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ABSTRACT 


This thesis is part of an ongoing larger seale researeh study started in 2004 at the 
Naval Postgraduate Sehool (NPS) whieh aims to develop a speeeh-driven human- 
maehine interfaee for the operation of semi-autonomous military robots in noisy 
operational environments. Earlier work ineluded eolleeting a small database of isolated 
word utteranees of seven words from 20 adult subjeets using an in-ear mierophone. The 
researeh eondueted here develops a speaker-independent isolated word reeognizer from 
these aeoustie signals based on a diserete-observation Hidden Markov Model (HMM). 

The study implements the HMM-based isolated word reeognizer in three steps. 
The first step performs the endpoint deteetion and speeeh segmentation by using short¬ 
term temporal analysis. The seeond step ineludes speeeh feature extraetion using statie 
and dynamie MFCC parameters and veetor quantization of eontinuous-valued speeeh 
features. Finally, the last step involves the diserete-observation HMM-based elassifier for 
isolated word reeognition. Experimental results show the average elassifieation 
performanee around 92.77%. The most signifieant result of this study is that the aeoustie 
signals originating from speeeh organs and eolleeted within the external ear eanal via the 
in-ear mierophone ean be used for isolated word reeognition. 

The seeond dataset eolleeted under low signal-to-noise ratio eonditions with 
additive noise results in 79% reeognition aeouraey in the HMM-based elassifier. We also 
eompared the elassifieation results of the data eolleeted within the ear eanal and outside 
the mouth via the same mierophone. Average elassifieation rates obtained for the data 
eolleeted outside the mouth shows signifieant performanee degradation (down to 63%), 
over that observed with the data eolleeted from within the ear eanal (down to 86%). 
Reeall that the ear eanal dampens high frequeneies. As a result, the HMM model derived 
for the data with dampened higher frequeneies does not aeeurately fit the data eolleeted 
outside the mouth, resulting in degraded reeognition performanees. 
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EXECUTIVE SUMMARY 


The performance of modem speech recognizers may turn out to be poor under 
adverse conditions, especially when classifiers are trained under high signal-to-noise ratio 
(SNR) environments like noise-free chambers (typically where SNR>30dB) and 
operated in real-world surroundings of relatively lower SNR. Consequently, many 
researchers have recently focused on the task of building more noise-robust recognizers 
that may be operated in noisy environments (within a car or a noisy flight cabin) with 
higher accuracy. 

This thesis study is part of an ongoing larger scale research started in 2004 at the 
Naval Postgraduate School which aims to develop a speech-driven human-machine 
interface command and control package applicable for the operation of semi-autonomous 
military robots in noisy operational environments. Earlier related research by 
Vaidyanathan [Vaidyanathan, 2004b] showed that the in-ear microphone device selected 
was sensitive enough to make use of tongue movements for a human-machine interface, 
which led to considering its extension to speech signals for isolated word recognition. 

This thesis study extended the earlier work by Newton [Newton, 2006] who 
collected a database of isolated word utterances of seven words from 20 adult subjects by 
using the in-ear microphone. The research conducted here develops a speaker- 
independent isolated word recognizer based on these acoustic signals and a discrete- 
observation Hidden Markov Model (HMM). Hence, the objectives of this research were 
to: 

• Develop a MATLAB simulation for a simple yet complete isolated word 
Hidden Markov Model (HMM) recognizer, 

• Investigate the feasibility of using the acoustic speech signals collected 
within the external ear canal via the in-ear microphone for isolated word 
recognition, 

• Explore the performance and noise-robustness of the in-ear microphone 
data for isolated word recognition in noisy operational environments. 


XIX 



The speech database of this study included seven isolated words {down, up, right, 
left, pan, move, kill}, each repeated 50 times by 20 adult subjects and recorded with an 8 
kHz sampling frequency. 

The study began by investigating the characteristics of the in-ear microphone data 
by spectrogram analysis. The in-ear microphone data can be considered a low-pass 
filtered version of the speech signals emitted in front of the mouth. Spectrogram analysis 
showed that most of the signal energy was gathered below 2.5 kHz in the acoustic signals 
of the isolated word utterances. Mean correction was first applied to the in-ear 
microphone data to eliminate the DC bias of the microphone and then a high-pass filter 
was used to remove the noise distortion collected around 100 Hz based on the 
spectrogram analysis. After high-pass filtering, the acoustic signal is split into frames for 
short-term analysis. The most crucial part of pre-processing of the in-ear microphone data 
is the front-end detection to locate the speech boundaries and segment the acoustic 
signals of the isolated words from the non-speech or silence portion. This task becomes 
harder when bodily-created noises like gulps, coughs, tongue clicks or lip smacks occur 
during the in-ear microphone recordings. Non-speech events or noise may complicate the 
endpoint detection of speech waveform especially when dealing with words starting or 
ending with low energy phonemes (such as the weak fricative /f/, weak plosive burst /p/ 
or /t/) or nasals (such as /n/ at the end). 

Two endpoint detection algorithms were implemented in MATLAB to determine 
the beginning and the termination of speech in a given isolated word utterance. The first 
segmentation algorithm combined two short-term speech signal analysis. The short-term 
energy (STE) measure was used to estimate initial speech boundaries while a second 
search of the zero-crossing rate (ZCR) was used to refine these boundaries. The second 
speech segmentation algorithm studied was based on the frame-based modified Teager 
energy. However, simulations showed this detection approach required longer processing 
time than the first algorithm due to its computationally-expensive search algorithm. In 
addition, the results showed the two algorithms provided comparable detection of 
endpoint locations. As a result, the focus rested on the first approach that uses STE and 


XX 



ZCR to segment the in-ear mierophone database due to its more eomputationally-effieient 
algorithm. 

The purpose of speeeh feature extraetion is to eonvert speeeh waveform to some 
type of parametrie representation at a lower information rate for further analysis. Speeeh 
signals have non-stationary eharacteristies. As a result, the speeeh waveforms are 
eommonly split into small frames (typically 5 ms to 40 ms) in which the signal 
characteristics are considered quasi-stationary to allow for short-term spectral analysis 
and feature extraction. A wide range of parametric speech representations may be used to 
generate input feature to speech recognizers. The most commonly used speech features 
include the Linear Predictive Coding (LPC) Coefficients, Real Cepstral Coefficients 
(RCC), LPC-derived Cepstral Coefficients (LPC-CC) and Mel-frequency Cepstral 
Coefficients (MFCC). LPC-CC and MFCC parameters were considered in this study to 
determine the best feature set for the in-ear microphone database. MFCC parameters are 
usually preferred over others because they are less susceptible to speaker-dependent 
variations likely to be present in speech signals. Classification results showed that MFCC 
parameters outperformed the LPC-CC parameters in recognition accuracy for the data 
under investigation. Dynamic delta-MFCC parameters were also derived to augment the 
spectral representation of the static MFCC parameters with some temporal information. 

A discrete-symbol HMM (DHMM) with eight hidden states was selected as the 
classifier type because that type of classifier is simple to implement, has low 
computational load, and has been shown to perform well in isolated word recognition 
applications. We selected a codebook of length 128 which was generated by applying the 
K-means clustering algorithm to the training set of speech features. Next, vector 
quantization (VQ) was applied to map the continuous-valued speech features (12 MFCC 
and 12 delta-MFCC parameters) to a discrete set of codebook indices (or symbols), and 
quantized features used to estimate the optimum model parameters. 

Two thirds of the data were used for generating the codebook and training the 
DHMM. The remaining one third of the data were used for testing the performance of the 
isolated word HMM recognizer. The average classification rate for this study’s multiple- 
speaker isolated word HMM recognition system was 92.77%. Results show that the 



acoustic signals originating from speech organs and collected within the external ear 
eanal via an in-ear mierophone eould be effeetively used for isolated word reeognition. 

The seeond dataset eolleeted under low SNR eonditions with additive noise 
resulted in 79.24% reeognition aeouraey in the HMM-based elassifier. This result is 
eonsistent with the theoretieal noise-shielding property of the human ear, thus justifying 
the use of in-ear mierophone data for speech enhancement and improved reeognition 
under low-SNR eonditions. 

Finally, we investigated using the in-ear mierophone outside the mouth and 
eomparing elassifioation results obtained for the data eolleeted within the ear eanal and 
outside the mouth. Average elassification rates obtained for the data eolleeted outside the 
mouth shows significant performance degradation (down to 63%), over that observed 
with the data eolleeted from within the ear eanal (down to 86%). The ear eanal dampens 
high frequeneies. As a result, the HMM model derived for the data with dampened higher 
frequeneies does not aeeurately fit the data eolleeted outside the mouth, resulting in 
degraded reeognition performanees. 



I. 


INTRODUCTION 


This thesis study is a part of an ongoing larger scale research started in 2004 at 
NFS which aims to contribute to the development of a speech-driven human-machine 
interface command and control package applicable for the operation of semi-autonomous 
military robots in noisy environments. Towards this end, this research focused on 
developing an isolated word recognizer using Hidden Markov Models (HMM). The 
HMM classifier was trained on air pressure signals collected from within the ear-canal 
via an in-ear microphone because previous research results have shown these signals are 
less susceptible to environmental noise distortions than those collected in front of the 
mouth [Westerlund, 2003]. In a broad sense, the ultimate goal of Automatic Speech 
Recognition (ASR) is to build and train computers/machines or artificial intelligence that 
can receive and interpret spoken instructions and act upon them properly. 

Great progress in ASR has yielded many practical applications in recent years, 
such as user-friendly speech interfaces in control consoles of cars, credit card number 
recognition and the verbal selection of menus over the telephone. However, after 50 year¬ 
long research efforts and considerable advances in ASR notwithstanding, robust speech 
recognition for human-machine interface still remains a challenging problem today. The 
performance of the modern speech recognizers may turn out to be poor under adverse 
conditions, especially when classifiers are trained under high signal-to-noise ratio (SNR) 
environments like noise-free chambers (typically where SNR >30 dB) and operated in 
real-world surroundings of relatively lower SNR [Deller, 2000]. In contrast, a healthy 
human listener’s performance is usually far more stable on average under similar training 
and operating conditions. Unfortunately many researchers agree that human-quality, 
adaptively-learning and noise-robust machines that recognize and interpret human speech 
will not be achieved in the near future [Deller, 2000; Gold, 2000; Owens, 1993]. 
However, even incremental improvements leading toward this ultimate goal in ASR are 
of great importance. 

Next, a brief history of speech recognition technology is presented before 
addressing the difficulties in the ASR area. 
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A. INTRODUCTION TO AUTOMATIC SPEECH RECOGNITION (ASR) 

Speech is the natural and the fundamental way of communication for most 
humans. Technically speaking, Automatic Speech Recognition (ASR) refers to a 
mechanism (hardware and software combined) that stores some representations of 
distinguishing characteristics of speech with a source of input equipment, such as a 
microphone and further processes these representations to match them to incoming 
speech in an effort to interact with machines, computers and/or human users 

The first primitive recognizer was developed at Bell Labs during the early 1950s. 
However, it was the 1960s that brought many major breakthroughs to the field of ASR. 
Some of these achievements are noteworthy to mention herein because they did not only 
develop significant tools for speech recognition but also established the very basic 
concepts on which this thesis work is mainly based. Specifically, the development of the 
Fast Fourier Transform (FFT) by Cooley and Tukey in 1965 decreased the 
computational load of Discrete Fourier Transform (DFT) with a faster algorithm, thereby 
enabling the practical implementations of Digital Signal Processing (DSP) custom chips 
[Cooley, 1965; Gold, 2000]. Oppenheim, Schafer, and Stockham introduced Cepstral 
Analysis which performs deconvolution of the speech signal to separate an excitation 
sequence from an impulse response convolved with it [Oppenheim, 1968]. Cepstral 
coefficients and many derivatives have been widely used to represent the short-term 
spectral envelope of speech signals thus far. 

It was also the late 1960s and early 1970s that saw another useful method for 
speech analysis, known as Linear Predictive Coding (LPC). One of the earliest and 
complete papers on the application of linear prediction to speech analysis was published 
by Atal and Schroeder [Atal, 1971; Deller, 2000]. Basically, LPC uses a pole-only 
(autoregressive) filter to model the speech signal. LPC coefficients and its derivatives are 
extensively used for transmitting speech spectral envelope information [Gold, 2000]. 

Most notably, the foundations for the statistical technique of Flidden Markov 
Modeling, which models an observed sequence as produced by a sequence of hidden 
states, dates back to the 1960s as well. However, the first successful applications of 
Hidden Markov Modeling to speech recognition were realized in the 1970s [Gold, 2000]. 
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Baum and his colleagues developed a popular expectation-maximization (EM) algorithm, 
known as the Baum-Welch Re-estimation Algorithm (or Forward-Baekward Algorithm), 
to estimate the parameters of a Hidden Markov Model (HMM) iteratively [Baum, 1966; 
Baum, 1970]. Hidden Markov Models (HMM) and the Baum-Weleh Re-estimation 
Algorithm are widely used today in contemporary state-of-the-art speech recognition 
systems. 

Dynamic Time Warping (DTW), a deterministic alternative approach to the 
statistical HMM was also introduced in the 1970s. DWT normalizes the different-length 
utterances of the same word and applies template-based elassifieation to speech 
recognition. 

Many different approaches ineorporating DTW, HMM and Artifieial Neural 
Networks (ANN) were developed for speech recognition in the 1970s. Among these 
studies, the project of the Advanced Research Projects Agency (AREA), was a 
remarkable aehievement in that it performed a 1000-word ASR system by using 
connected speeeh from a few speakers with a word error rate of less than 10% [Gold, 
2000]. In the 1980s, the projeet of the Defense Advaneed Researeh Projeets Ageney 
(DARPA Projeet) and the major other programs conducted by Texas Instruments and the 
Massachusetts Institute of Technology (TIMIT Projeet) and the National Institute of 
Standards and Technology (NIST) primarily concentrated on the eolleetion of large 
eorpora used for training and testing speeeh recognizers. These large eorpora were 
subsequently used by the ASR researeh eommunity at large for performance comparison 
of different approaches applied to speech recognition. 

The ASR community witnessed some other important developments in the 1980s 
as well. Among those, the Mel-Cepstrum Analysis introdueed by Davis [Davis, 1980], 
and the Dynamic Cepstral Coefficients proposed by Furui [Furui, 1986] can be 
eonsidered remarkable techniques for speeeh feature-extraction due to significant 
improvement in reeognition accuracy. 

As for the speech recognizers of the 1980s, many researehers were experimenting 
with frame-based HMM reeognizers, ANN reeognizers or hybrid schemes eombining 
HMM and ANN in isolated and/or continuous contexts of speeeh [Gold, 2000]. Most 
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importantly, the contemporary speech recognition systems of today still use these 
methods predominantly. 

In the following decades, advances in computationally-efficient digital computing 
power and abundant memory made the inexpensive implementation of DSP chips and 
Field-Programmable Gate Arrays (FPGA) possible. Needless to say, this has been a 
major boon to the performance of ASR technology. Commercial ASR applications such 
as speech-to-text dictation software have been available and used extensively since the 
late 1990s. However, despite considerable progress, many aspects of ASR are still 
mysterious and problematic even today because the engineering view of human speech 
production and human ear perception is not yet fully conclusive. Hence, the research 
community believes that the field of speech recognition is still in its early infancy and 
will remain to pose a challenging problem for the near future. [Deller, 2000; Gold, 2000]. 

Next, the dimensions of difficulty involved in ASR are discussed because 
understanding the complexity of the ASR problem is the first step to practical and 
achievable solutions. 

B. DIMENSIONS OF DIFFICULTY IN ASR TECHNOLOGY FOR HUMAN - 

COMPUTER INTERACTIONS 

Waibel and Lee addressed the question of complexity involved in ASR in 
[Waibel, 1990] as ''dimensions of difficulty” These are the factors that determine the 
complexity and the specifications of an ASR system. Deller et al. summarizes some of 
these factors that render speech recognition methods complicated [Deller, 2000]. 

First, speaker-dependency plays a defining role in speech recognition systems. A 
speaker-dependent ASR system is trained and tested on the utterances of a specific 
speaker to learn the representations characterizing speech. Thus, this type of recognizer is 
designed specifically to recognize the speech of its trainer. Most commercially available 
speech-to-text dictation software mainly uses this mode because it needs to be trained by 
each specific user before operation. On the contrary, a speaker-independent system is 
trained on many speakers in an effort to make the system capable of recognizing the 
speech of a generalized population who may be outside of the training population. The 
models used in this type of recognizers are derived from expected characteristics and are 

not fine-tunable to the end users. Therefore, there is no need for further training when a 
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new speaker is introduced to a speaker-independent system. This operation mode is more 
convenient for a telephone application where users navigate through speech-activated 
menus. However, the convenience of speaker-independent systems may come with lower 
recognition accuracy than that obtained with speaker-dependent systems. 

Second, continuous speech recognition (CSR) systems may seem natural and 
user-friendly at first glance, but they require more sophisticated recognizers to handle 
word boundaries issues. Isolated word recognition (IWR) systems minimize this problem 
as they require a deliberate pause between each utterance to simplify the end-point 
detection algorithm needed to locate the beginning and the end of speech. As a result, 
IWR is expected to have higher recognition accuracy. Generally speaking, CSR systems 
require more CPU power and memory than ISR systems. Further, inter-speaker and intra¬ 
speaker variations of the training population in articulation, pronunciation and intonation 
make it even harder for CSR systems to determine speech boundaries, thus yielding lower 
classification accuracy compared to that obtained with ISR systems. 

Third, the vocabulary size is another key factor that affects the dimension of 
difficulties in recognition. Considering the number of ambiguous utterances (e.g., 
“knight” and “night”) and acoustic confusability (e.g., “beer” and “bear”) in the 
vocabulary of interest naturally, the performance of a particular recognizer is expected to 
degrade with the increasing vocabulary size [Deller, 2000]. Small vocabulary (less than 
100 words) recognizers can perform relatively simpler tasks such as destination sorting 
systems for shipping tasks, credit card number or telephone number recognition. In these 
examples, specific models for each word in the vocabulary can be stored in the system 
and recognition is achieved by an exhaustive search through the whole vocabulary. As 
the vocabularies become larger, the recognition task creates increasing memory 
requirements. When training and modeling for each word in a larger vocabulary becomes 
impractical, models of subword units like syllables and phonemes are preferred over 
models of words. 

Fourth, the waveform of a speech signal is very susceptible to the variations in 
channel, microphone characteristics, room reverberation or background noise. Nonlinear 
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effects of noise and channel distortion can be very destructive for recognition tasks, 
especially when no a-priori knowledge is available about their characteristics. 

Finally, another important parameter in ASR system performances is whether any 
linguistic information is built into the recognizer to fine-tune algorithms. Speech 
recognizers are trained on basic speech unit models such as phones, phonemes, syllables 
or words. Linguistic (or grammar) constraints deal with how these basic units should be 
concatenated to form a meaningful message in a particular language. Such linguistic 
information may be embedded into more sophisticated recognizers. 

All the above factors have a major impact on the success of a particular ASR 
system and add up to determine the necessary level of system complexity in design 
phase. Among all of the above-mentioned factors, performance degradation in noisy real- 
world environments is probably the most significant factor limiting the progress of ASR 
technology today. Consequently, many researchers have recently focused on the task of 
building more noise-robust recognizers that may be operated in noisy environments 
(within a car or a noisy flight cabin) with higher accuracy [Moon, 1997; Shozakai, 1998; 
Graciarena, 2003]. Lippmann et al. reported a direct comparison of human listeners and 
machine recognizers on a range of tasks in his speech recognition research [Lippman, 
1997; Gold, 2000]. His findings supported the fact that noise considerably degrades the 
performance of ASR systems even for simple tasks, whereas the superiority of human 
performance increases in noise and for more difficult speech contexts such as 
spontaneous speech unlike machine recognizers. His experimental results on human 
perception of distorted speech sets a good benchmark for the ultimate goal of more 
humanlike, noise-robust and adaptive ASR systems of the future [Gold, 2000]. 

Next, the main objectives are presented as well as the specifications of the 
experimental ASR system on which this thesis work is focused. 

C. THESIS OBJECTIVE AND EQUIPMENT USED 

The objective of this research was to design an isolated word HMM recognizer 
with a small vocabulary to evaluate the effectiveness of the acoustic speech signals 
collected within ear canal on speech recognition. This work uses a speech database 
collected earlier by Newton [Newton, 2006] which consists of isolated word utterances 
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from 20 adult subjects. Each subject repeated a set of seven isolated words {down, up, 
right, left, pan, move, kill} 50 times, recorded with an 8 kHz sampling frequency. 

The speech signals were collected via the in-ear microphone shown in Figure 1.1 
in an indoor office environment with medium SNR levels to simulate real-world 
conditions and potential operational environments. A second set of data were collected 
under additive noise in the same office environment with the in-ear microphone located 
in the ear first and then in front of the mouth to evaluate the noise shielding properties of 
the in-ear microphone. Figure 1.2 shows the whole equipment set-up used for database 
collection including the in-ear microphone, analog-to-digital (A/D) converter, PCMCIA 
card and a general purpose notebook computer. 



Figure 1.1. Foam-Encased in-Ear Microphone Device. 
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Figure 1.2. Equipment Components Used for Speech Data Collection. 


Next, the motivation behind this study is discussed and previous related research 
summarized. 

D. PREVIOUS RELATED RESEARCH 

As ASR technologies are transferred more and more to daily-life applications, the 
need for greater robustness to noisy environments is growing. A great deal of recent work 
has highlighted the need for robust speech detection, enhancement and recognition via 
the use of multiple-microphone arrays [Zhang, 2004] or more sophisticated adaptive 
noise canceling devices [Westerlund, 2003]. 

Westerlund et al. proposed a sophisticated approach to improve speech 
recognition where a Flidden Markov Model (FIMM) recognizer was used to evaluate the 
quality and intelligibility of speech signals recorded using an ear-microphone. One of the 
pieces of equipment used in his research was an Active Noise Control (ANC) headset 
equipped with an in-ear microphone and ear muffs. This work showed that the high-pass 
filtering property of the passive absorbers (ear-muffs) is convenient for noise reduction 
when combining an ANC headset with an in-ear microphone since the speech signal in 
the external auditory canal is a low-pass filtered version of the speech signal in front of 
the mouth. Consequently, Westerlund reported that this combination of passive and 
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active noise reduction methods with the use of an ear-mierophone had considerable 
impaet on speeeh recognition performances, resulting in a 50% recognition rate inerease 
in a severely noisy environment. 

The idea that a mierophone can be placed on other loeations of the body rather 
than just in front of the mouth was also proposed by other researchers [Zhang, 2004; 
Graciarena, 2003]. Zhang et al. developed a headset prototype that integrates several 
heterogeneous sensors including a close-talk and a throat microphone shown in Figure 
1.3 for robust speech detection and recognition, achieving promising results. 



Figure 1.3. Throat and Air-Conductive Microphone (From [Zhang, 2004].). 

Graciarena et al. also developed a method that combines standard microphone and 
throat microphone features in noisy continuous speech recognition experiments. This 
researeh provided signifieant word error rate reduction compared to that obtained with a 
single microphone [Graciarena, 2003]. Although all the above-mentioned research 
reported eonsiderable robustness and improvement on speech recognition, the 
implementations used were either intrusive or involved rather complicated schemes in 
terms of computational load. 

Vaidyanathan et al. recently introduced a unique signal processing strategy which 
maps the air flow signals generated from tongue movements collected within the ear 
canal to specific commands [Vaidyanathan, 2004a]. The in-ear mierophone used in this 
work is simpler in design and more eonvenient to wear when eompared to more 
eomplieated and intrusive alternatives. The microphone earpiece eontains a small passive 
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sensor, which is commercially available off-the-shelf and is similar to those commonly 
found in hearing aids. This work showed that various tongue movements within the oral 
cavity create unique and traceable pressure changes in the human ear, which can be 
captured via an ear-microphone and classified accurately. These results were applied to 
the hands-free tele-operation of a mobile robot named “Whegs II hexapod robot. ” This 
research also reported that 

...Conventional signal processing techniques are generally inadequate to 
recognize the subtle pressure variations in the ear canal resulting from 
tongue movement. The ear canal itself is an interference-ridden, noise- 
amplified environment for acoustic recording. External noise 
(environmental sounds) can also easily obscure the slight pressure 
deviations accompanying tongue movement [Vaidyanathan, 2004b]. 

This earlier research study showed that the in-ear microphone device selected was 
sensitive enough to collect tongue movements, which led to considering its extension to 
speech signals. The next section presents the organization of this thesis work. 

E. THESIS ORGANIZATION 

This initial chapter of the thesis provided an introduction to ASR technology and 
focused on the main challenges in the ASR field so as to present the fundamentals of 
ASR. Then the chapter introduced the main objective of this thesis work, equipment used 
for speech data collection and explained the earlier related research. 

Figure 1.4 represents the main framework of the Isolated Word HMM recognizer 
built to test the viability of using ear-microphone data for speech recognition in this 
research. 



Figure 1.4 Main framework of the Isolated Word HMM recognizer 


Chapter II first presents a discussion of the characteristics of in-ear microphone 
data. Next, the chapter covers the pre-processing of these acoustic signals. The pre- 
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processing involves the short-term signal analysis, the endpoint detection algorithm to 
determine the onset and the termination of a spoken word in a given acoustic speech 
signal. 

Chapter III introduces the popular speech feature coefficients used to extract 
spectral information from acoustic signals. The computation of Mel-frequency Cepstral 
Coefficients (MFCC) is explained using a step-by-step approach. Then, the dynamie delta 
parameters (delta-MFCC) are derived from the MFCC parameters to supplement 
information of MFCC parameters with information on its rate of change over time. The 
rest of the chapter deals with the vector quantization (VQ) since a discrete-observation 
HMM is to be used rather than a continuous-density HMM. The K-means clustering 
algorithm is described and used to generate a codebook which maps the continuous¬ 
valued speech features (MFCC and delta-MFCC parameters) to a discrete set of 
codebook indices. 

Chapter IV discusses the fundamentals of HMM, the three basic HMM problems 
and their efficient solution algorithms. The chapter concludes with the discrete-symbol 
HMM implementation of the isolated word recognition (IWR) system to justify the 
effectiveness of the air flow signals collected within the ear canal for speech recognition. 

Chapter V provides a system overview of the isolated word HMM recognizer and 
presents HMM elassification results. 

Chapter VI presents conclusions and recommendations for future work. 
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II. IN-EAR MICROPHONE DATA DIGITAL PRE-PROCESSING 

A. INTRODUCTION 

Traditional speech recognition systems typically use a directional microphone to 
collect speech data. Most directional microphone systems are successful in low-noise 
environments. However, speech recognition performances may degrade significantly in 
high noise surroundings, and many different methods have been proposed to improve the 
robustness of speech recognition systems. Some of these include complex techniques 
using multiple microphone arrays (to work on multi-channel data), adaptive noise 
cancellation methods, combining different types of speech features or using sophisticated 
endpoint detection schemes. However, most of them are either intrusive or impractical to 
implement in real world applications. 

In an effort to guard against noisy surroundings and allow for a relatively simple 
speech recognition approach, an unconventional recording device, an earpiece 
microphone placed inside the external ear canal was used for capturing the speech signal. 
Data were collected using an 8 kHz sampling rate to ensure at least telephone-quality-like 
spectral information would be available to the recognizer. A small vocabulary consisting 
of seven isolated words was recorded in an earlier study [Newton, 2006] and included the 
following words: left, right, up, down, move, kill and pan. Twenty experimental adult 
subjects uttered each word 50 times, which led to a database of 20 x 50 x 7 = 7000 words. 

First, we present a brief overview of human auditory and speech production 
systems is presented, and discuss the main differences observed between the speech data 
collected within the ear canal and the data collected in front of the mouth. 

Fundamentally, the simplest form of communication occurs when a speaker 
releases an acoustic sound pressure wave in the air by using speech articulators such as 
vocal folds, palate, teeth, tongue, etc. For example, a single-frequency sound wave 
traveling through air from a speaker’s mouth to a listener’s ears excites a sinusoidal 
pressure variation in the direction of propagation. 

Speech processing is a multi-disciplinary field as hearing is an integral part of the 
speech chain, and a broad review of sound-wave propagation and hearing mechanism can 
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be found in the white paper, [Sibbald, 2001]. Due to the elose eonneetion between the 
unique method of speeoh-eollection and the human hearing proeess, it is first neeessary to 
highlight some basie related dynamies of the human auditory system before proeeeding to 
an engineering view of speeeh produetion. The frequeney response of a healthy human 
ear is between approximately 20 Hz to 20 kHz, with the high frequeney eapability 
diminishing with age. Consequently, the standard CD audio sampling frequeney was 
ehosen to be 44.1 kHz, whieh allows for frequeneies up to the usual highest frequeney 
pereeivable by the human ear sinee the sampling frequeney should be at least double the 
highest frequeney eomponent available. However, the human ear is most sensitive to a 
frequeney speetrum ranging from 500 Hz to 4000 Hz, whieh roughly eorresponds to the 
speeeh bandwidth earried along analog telephone lines [Marsh, 1999]. In addition to this 
inherent sensitivity, the human ear is eapable of a wide dynamie range of hearing 
intensity, whieh is praetieally measured at approximately 130 dB from the threshold of 
hearing to the threshold of pain. It should be noted that for the purposes of this study, the 
struetures of the outer and middle ear ean be eonsidered to be both a pre-amplifier and a 
limiter that eontribute to this remarkable sensitivity and the dynamie range [Sibbald, 
2001 ]. 

Although the details of the human hearing meehanism is beyond the seope of this 
thesis, it is important to note that the auditory eanal ean resonate and amplify sounds 
within a frequeney range of approximately 2000 Hz to 5500 Hz by up to a faetor of 10 
[Marsh, 1999]. Besides, further amplifieation - up to 20 times at some resonanee 
frequeneies - takes plaee in the oval window of the eoehlea. This knowledge provides a 
elue why minute pressure variations of external noise are amplified inside the ear. Henee, 
it was neeessary to isolate the in-ear mierophone from environmental noise sourees by 
plaeing it inside a foam ease during the data eolleetion proeess. 

The major eomponents of the human speeeh produetion system are the lungs, 

traehea, larynx (organ of voiee produetion), throat, oral eavity and nasal eavity; the last 

three together being referred to as the vocal tract in teehnieal diseussions. Finer aeoustie 

organs eritieal to speeeh produetion inelude voeal eords (or voeal folds), tongue, lips, 

teeth, velum and jaw, whieh are ealled articulators by speeeh seientists. Speeeh is 

eharaeterized as a diserete sequenee of sound segments ealled phones, eaeh 
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corresponding to certain positions of articulators during their production despite the fact 
that the analog speeeh signal is naturally continuous and dynamic in time and magnitude. 
Phones are acoustic manifestations of phonemes, which are considered to be linguistic 
codes that comprise a language. An interested reader can refer to [Deller, 2000] for a list 
of approximately 42 phonemes that constitute the language of Ameriean English. 

From a simple engineering point of view, speeeh production can be considered an 
acoustic filtering process in which a speech sound source excites the voeal tract filter 
[Deng, 2003]. More technically, the larynx provides a periodic vibration to the vocal 
cords, causing quasi-periodic voiced speech. Voiced sounds consist of a fundamental 
frequency and its harmonics produced by vocal cords. Likewise, the fundamental period, 
or pitch period (or only pitch), of voiced speech corresponds to the suceessive vocal cord 
closure rate [Owens, 1993]. According to Deller, an alternative definition of the term 
pitch refers to the perceived fundamental frequency of a sound, whether or not that sound 
is actually present in the waveform. Although every speaker has a habitual pitch level, 
pitch generally shifts up and down during speaking in response to various factors 
including stress, intonation or emotion. The vocal tract changes the periodic vibration 
causing formant (or resonances) and sometimes anti-formant (or anti-resonances) 
frequencies, owing to the poles and the zeros in the vocal tract transfer function, 
respectively [Witten, 1982]. Further, the vocal tract length significantly affects all vowel 
formants frequency locations. Formant locations and pitch period are commonly used 
characteristics of phonemes to extraet information from the speech signal for further 
analysis. 

Unvoiced sounds are generated by forcing air flow through a vocal tract 
constriction between the glottis and mouth. Flnlike voiced speech, there is no 
fundamental frequeney in an excitation signal with unvoiced sounds, resulting in a noise- 
like and aperiodic signal. Note that unvoiced consonants (such as the weak fricative /f/ 
and stop sound /t/ as in the word “Left” or the weak plosive burst /p/ as in the word 
“Pan”) have considerably less energy and are usually focused on higher frequencies, 
whereas the voiced vowels (such as /u/ as in the word “Flp” or /o/ as in the word “Move”) 
have most of the energy loeated in the lower portion of the spectrum as a main difference 
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[Fargues, 2005]. Table 2.1 summarizes the phonemes and the corresponding 
characteristics of the words used in this study. 


Word 

Phoneme 

Manner of articulation 

Voiced? 

Example word 

Left 

1 

liquid 

yes 

love 

e (eh) 

vowel 

yes 

bet 

f 

fricative 

no 

fluff 

t 

stop 

no 

tot 

Right 

r 

liquid 

yes 

roar 

av (ay) 

diphthong 

yes 

bite 

t 

stop 

no 

tot 

Up 

A (ah) 

vowel 

yes 

mud 

P 

stop 

no 

pop 

Down 

d 

stop 

yes 

did 

au (aw) 

diphthong 

yes 

bout 

w 

glide 

yes 

wow 

n 

nasal 

yes 

none 

Move 

m 

nasal 

yes 

maim 

u (uw) 

vowel 

yes 

boot 

V 

fricative 

yes 

valve 

Pan 

P 

stop 

no 

pop 

se (ae) 

vowel 

yes 

bat 

n 

nasal 

yes 

none 

Kill 

k 

stop 

no 

kick 

i(iy) 

vowel 

yes 

beat 

1 

liquid 

yes 

love 


Table 2.1. English Phonemes and Characteristics for Vocabulary Words of Interest. 
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Having given a brief discussion of human auditory system, speech production and 
the general characteristics of speech signal, it is now possible to focus on the unique 
properties of the speech data collected via the in-ear microphone. Previous discussions 
will demonstrate the differences between typical speech signals recorded by conventional 
directional microphones and those obtained from the in-ear microphone. 

Although most of the acoustic sound energy originates from the mouth, some 
sound waves are emitted from the nostrils, throat and cheeks as well [Deller, 2000]. All 
these pressure variations in the air constitute the analog speech waveform. This speech 
waveform is first recorded by the in-ear microphone placed inside the external ear canal 
and then converted to a digital format by an analog-to-digital converter for further 
processing. According to Deng, typical human communication spans from below 100 Hz 
up to 7 kHz due to the limitations of speech production organs [Deng, 2003]. 

In an effort to compare the two different data-collection methods. Figure 2.1 
shows time plots and spectrograms obtained for the word “right” collected with the in-ear 
microphone placed within the ear canal (top plots) and in front of the mouth (bottom 
plots). Both the time plot and spectrogram illustrate a portion of 2,000 samples in the 
signals. Note that both signals sounded very similar except for a small difference in the 
signal loudness (or amplitude). Figure 2.1 shows that signal collected from the mouth has 
higher frequency components than that collected within the ear canal. The signal 
collected from the mouth (the lower waveform) has a comparatively higher frequency 
content up to 3000 Hz, while the signal collected from the ear has most of its energy 
located in the spectrum up to 1000 Hz. Similar behavior regarding the frequency ranges 
was also observed for the other words of interest in the vocabulary. Figure 2.1 also shows 
the presence of a low frequency distortion resulting from the collection equipment. 

It was observed that signals obtained from within the ear canal had low 
amplitudes and the A/D converter gain control was adjusted accordingly before recording 
to ensure that speech signals levels were high enough for recording purposes. 
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Figure 2.1. The Word “Right” Reeorded via Ear-Microphone. 


Next, speech pre-processing issues are discussed. Spectral analysis, fdtering, 
framing and windowing for short-term analysis are covered in detail. 


B. SPEECH PRE-PROCESSING 

1. Spectral Analysis 

Spectral analysis basically involves digital filtering techniques to remove the 
noise and emphasize important frequency components of interest. Considering that the 
foam-encase of the ear-microphone decreases the amount of external noise in the ear 
canal during recording, the need for filtering mostly stems from the existence of bodily- 
created noises such as gulps, tongue clicks, lip smacks or coughs in the signal or due to 
the collection equipment. In addition, the microphone and the A/D converter used also 
introduce undesirable side effects, such as a typically 50-60 Hz humming noise, a 
fluctuating dc bias in the microphone and some other non-linear distortions due to the 
converter [Picone, 1993]. 
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Before extraeting speeeh feature information from the in-ear mierophone data, it 
is first necessary to deal with the distortions in the signal to eliminate any subsequent 
problems with the parametric spectral analysis. 

a. Spectrogram Analysis 

First, the speech signal spectrograms were analyzed to identify the 
frequency bands most affected by unwanted distortions. Figure 2.3 illustrates the 
spectrograms obtained for one trial of each word contained in the vocabulary spoken by 
the same adult male subject. 


19 



N 

X 

>% 

o 

c 

O) 

3 

w 

0> 


4000 


2000 



"Down" 


0.1 0.15 

Time (sec) 



0.1 


0.2 0.3 

Time (sec) 


0.4 



N 

X 

O 

<Z 

3 

3" 

O) 


4000 


2000 



0.2 0.3 

Time (sec) 



"Right" 



X 4000 


"Kill" 


o 


3“ 

O) 


2000 


0.1 


0.2 0.3 0.4 
Time (sec) 


0.5 


Figure 2.3. Spectrograms for the Utterances of Each Word in the Vocabulary. 

Figure 2.3 shows that recordings contain non-speech components in the frequency 
band between 50 to 100 Hz. It also shows that most of the speech signal energy for all 
words considered is roughly contained in the frequency band ranging from 100 Hz up to 
2000-2500 Hz. 
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b. Mean Correction 

The purpose of mean eorreetion is to remove any dc off-set that may be 
introduced by the microphone or the converter if any. Figure 2.4 shows the time-domain 
plot of the word, “down” with a mean value of 0.0265. Although the offset in the signal 
before mean correction is small, the effect of mean-subtraction can be seen as a shift in 
abscissa (amplitude) by close investigation. 



Figure 2.4. Mean Correction on One Trial of the Word “Down;” Top Plot: before 
Correction, Bottom Plot: after Correction. 

c. High-Pass Filtering vs. Pre-Emphasis Filtering and Band-Pass 
Filtering 

Today’s speech recognition applications require relatively high signal-to- 
noise ratio (SNR) levels (usually above 30 dB) to insure high recognition performances 
[Picone, 1993]. Non-linear distortions due to recording devices, A/D conversion and 
background noise degrade recognition performances. Such adverse effects can be partly 
removed or decreased by applying a finite impulse response (FIR) high-pass filter. 
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Usually, a one-coefficient simple digital filter, known as a pre-emphasis 
filter, is used. There are two common reasons for using this filter. First, voiced portions 
have a typical attenuation of about -12 dB per octavei with increasing frequency. Thus, 
the pre-emphasis filter serves to compensate such high frequency attenuation and 
improves the spectral feature extraction [Deng, 2003]. Second, human hearing is more 
sensitive in the frequency band above 1 kHz. The pre-emphasis filter amplifies this 
region of the spectrum and contributes in modeling the most perceptually significant 
portion of the spectrum [Picone, 1993]. A common form of the pre-emphasis filter is 
given in [Deng, 2003] as follows: 

y{n) = s{n)-As{n-Y), (2.1) 

where s(n) is the speech signal and A is typically chosen between 0.9 and 1.0, reflecting 
the degree of pre-emphasis. 

However, recall that Figure 2.3 showed that the ear canal dampened high 
frequency components as most of the words energy content was contained in the 
frequency range between 100 Hz to 2500 Hz, which is lower than that observed with the 
speech collected outside the mouth with the same microphone. Consequently, the pre¬ 
emphasis filtering did not contribute to the efficiency of the spectral feature analysis in 
the experimental results with the MATLAB simulations. Thus, an alternative filtering 
technique, high-pass filtering, was preferred to mask the low frequency ambient noise. A 
6* order infinite impulse response (HR) elliptic high-pass filter was applied. The filter 
specifications were as follows: 

• The stopband: 0-60 Hz. 

• The passband: 100 - 4000 Hz (half of the sampling frequency). 

• The passband ripple: 0.5 dB. 

Figure 2.5 represents the frequency response of the high-pass filter used. 


1 Two frequencies are an octave apart if the ratio of the higher frequency to the lower frequency is two. 


22 



Phase (degrees) Magnitude (dB) 


0 


Magnitude Response (dB) 


-10 - 
-20 - 
-30 - 
-40 - 
-50 - 
-60 - 
-70 1 

-80 li 
-90 L 


0 0.5 1 1.5 2 2.5 3 3.5 


Frequency (kHz) 


Phase Response 



Figure 2.5. Frequency Response of the HR Elliptic HPF. 


Note that the nonlinear phase response may introduce some phase 
distortion in the filtered speech signal. Flowever, verification was completed by listening 
to ensure that the filtered signal was still perfectly understandable to the human ear. 
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Figure 2.5 illustrates the high-pass filter effects on one trial of the word 
“Down.” Upper and lower plots present time and spectrogram traces before and after 
filtering, respectively. Results show that the filter has removed the low frequency trends 
observed in the original which elicit the higher frequency components more clearly in the 
spectrogram plot of the filtered signal. 
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Figure 2.5. Fligh-Pass Filtering Effect on One Trial of the Word “Down.” 


Experiments were also conducted with a 6^^ order HR elliptic band-pass 
filter for performance comparison with the high-pass filter. This band-pass filter had the 
following specifications: 

• The first stopband: 0-60 Hz. 

• The passband: 100-2000 Hz. 

• The second stopband: 2400 - 4000 Hz. 

• The stopband attenuation: 60 dB. 

• The passband ripple: 0.5 dB. 
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The high-pass filtered data resulted in, on average, 4% better recognition 
rates than the recognition rates obtained from the band-pass filtered data. 

d. Framing and Windowing for Short-term Analysis 

Speech is a dynamic and non-stationary process, as the amplitude of the 
speech waveform varies with time due to variations in the vocal tract and articulators 
[Deng, 2003]. However, speech analysis usually presumes that the statistical properties of 
the non-stationary speech process change relatively slowly over time. Although this 
assumption is not strictly valid, it makes it possible to process short-time speech frames, 
ranging typically from 10 ms to 40 ms, as a stationary process. Generally speaking, the 
use of short frame duration and overlapping frames is chosen to capture the rapid 
dynamics of the spectrum. Speech parameters are extracted on a frame-by-frame basis 
and the amount of overlap determines how quickly parameters can change from frame to 
frame [Picone, 1993]. 

Windowing means multiplication of a speech signal s{n) by a window 
w{n) to weight or favor samples by the shape and duration of the window. Coupled with 
overlapping short-term frames, successive windowing is equal to applying a sliding 
window to the long-term speech signal. The simplest window has a rectangular shape, 
weighting all samples of speech signal equally. In fact, not windowing segmented short- 
duration frames at all is equivalent to applying a rectangular window. 

Window duration determines the amount of averaging used in power or 
energy calculation. Window duration and frame duration can be adjusted as a pair. For 
instance, a frame duration of 20 ms can be coupled with a window duration of 30 ms 
[Picone, 1993]. An alternative is to choose the window duration equal to the frame 
duration for simplicity. 

As a matter of fact, the selection of proper frame duration (and indirectly 
the window duration) is eventually dependent on the change of rate of the vocal tract 
shape. Frames of 10 ms with 50% overlap are chosen and a rectangular window applied 
to each frame before calculating short-term parameters such as zero-crossing rate and 
short-term energy. Note that selecting too small a frame duration or equivalently 
increasing the frame rate would result in increased complexity and memory requirements. 


25 



2. Speech Boundary Detection 

Determining the beginning and the termination of speeeh in the presenee of 
baekground noise is a eomplieated problem [Rabiner, 1975]. 

The problem of separating speeeh from baekground silenee and noise does not 
eonstitute a minor task, exeept in oases of very high signal-to-noise ratios (on the order of 
30 dB or better). Environments with high SNR are rarely seen in real-world applioations 
of speeeh reoognition systems [Rabiner, 1975; Deller, 2000]. Further, a noise-robust 
endpoint deteotion algorithm must also be oapable of dealing with speaker-dependent 
disturbanoes like ooughs, gulps, tongue olioks, lip-smaoks, eto. [Srydal, 1995]. 

Also note that the effioienoy of aoourate endpoint deteotion has a signifioant and 
direot effeot on the performanoe of the entire reoognition system. In praotioe, the prooess 
of aoourate endpoint deteotion is not stable and many reoognition faults (or 
misolassifioations) oan be traoed baok to poor endpoint deteotion [Qiang, 1998]. 

Rabiner and Deller et al. reported the broad oategories of low energy phonemes, 
whioh usually oause endpoint deteotion failures [Rabiner, 1975; Deller, 2000]; 

• Weak frioatives ({f, th, h}, suoh as /f/ looated in the word, “half’). 

• Weak plosive bursts ({p, t, k}, suoh as /p/, /t/ and /k/ present in the words, 
“pan”, “left” and “kill” respeotively). 

• Final nasals ({m, n, ng}, suoh as /n/ present in the words, “pan” and 
“down”), and 

• Trailing vowels at the end (suoh as /u/ in “zoo”). 

Ordinarily, zero-orossing rate and short-term energy oan be oombined to form 
appropriate spatial or temporal features for determining the onset and termination of 
speeeh boundaries [Qiang, 1998]. Note that these temporal features oan be extraoted 
simply from the sample values of speeeh signal without transforming the signal into the 
frequenoy domain. 

Rabiner et al. proposed a fairly simple and reliable algorithm for deteoting the 
beginning and the termination of speeeh utteranoes in low noise environments (i.e., in a 
SNR level of at least 30 dB), based on short-time energy and a zero-orossing rate of 
speeeh [Rabiner, 1975]. This work mainly fooused on the implementation of said 
algorithm in MATFAB. 


26 



a. Short-term Energy Measure (STE Measure) 

Short-term energy is a eommon parameter that has been used sinee the 
1970s, espeeially to distinguish between voieed sounds and unvoieed sounds or silence 
[Qiang, 1998]. Qiang et al. compared the performances of the following three short-time 
energy measurements in endpoint detection and concluded that the absolute short-term 
energy is the most effective energy parameter for this task. 

N 

• Logarithmic short-term energy: = ^log[5(n)^], (2.2) 

n=\ 

N 

• Squared short-term energy: E^^^ = (2.3) 

n=l 

N 

• Absolute short-term energy: = EN")!- (2-4) 

n=l 

Consequently, the absolute short-term energy was selected as the short- 
time energy parameter in this endpoint detection algorithm due to its simple 
implementation and efficiency. The speech signal was first divided into 50% overlapping 
frames of 10 ms and then passed through a rectangular window. The short-term absolute 
energy was computed by summing the absolute magnitudes of speech samples in each 
frame as shown in Eq. (2.4) above. Therefore, the short-time absolute energy can be 
computed as follows: 

m 

E^= ^ \s(n)w(m-n)\, (2.5) 

where w(m) refers to the rectangular window with the A-length frame duration ending at 
n = m. The use of an absolute magnitude function rather than square or logarithm 
functions simplifies the computations, thereby decreasing the computational load in 
implementation. 

b. Short-term Zero-crossing Rate (STZC Rate) 

The number of zero-crossings, which is also a useful temporal feature in 
speech analysis, refers to the number of times speech samples change sign in a given 
frame. The rate at which zero-crossings occur is a simple measure of the frequency 
content of a narrowband signal [Rabiner, 1978]. The short-term zero-crossing measure, as 
formally defined in [Deller, 2000] for the A-length frame interval ending at n = m , is 
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( 2 . 6 ) 


1 m 

zcR=— y 

M ^ 

^ ’ n=m-N+l 


sgn[5(n)]-sgn[5(n-l)] 


w{m - n). 


The multiplication by a factor of — is intended to take the average value 


of the zero-crossing measure. Therefore, the factor of — can be dropped off for 
simplification in computation. 


The zero-crossing rate is a useful parameter for estimating whether speech 
is voiced or unvoiced [Deng, 2003]. Voiced speech has most of its energy collected in the 
lower frequencies, whereas most energy of the unvoiced speech is found in the higher 
frequencies [Rabiner, 1978]. Since speech is wideband, the zero-crossing rate 
corresponds to the average frequency of the primary energy concentration in the signal. 
Since high frequencies imply high zero-crossing rates and low frequencies imply low 
zero-crossing rates, high and low zero-crossing rate correspond to unvoiced and voiced 
speech, respectively. 

c. Endpoint Detection Algorithm from STE and STZC Measures 
A popular method for endpoint detection of speech boundaries was first 
published by Rabiner and Sambur [Rabiner, 1975]. The short-term energy and zero¬ 
crossing rate of speech have been extensively used to detect the endpoints of an utterance 
since then. 


This algorithm was adopted and modified to better locate the beginning 
and the termination of speech from the ear-microphone data. This is a two-step search 
algorithm where the short-time energy for a coarse search is first used. Then, the zero¬ 
crossing measure fine-tunes the coarse boundaries expanding forward and backward. The 
zero-crossing measure applied in the second search helps detect low-energy phonemes at 
the beginning or end of the word, especially when dealing with weak fricatives (/f/, /th/, 
/h/), plosive bursts (/p/, /t/, /k/) or final nasals (/m/, /n/, /ng/). Figure 2.6 shows short-term 
energy and zero-crossing measures for a typical utterance of the word “left” from ear- 
microphone data. The mean and the standard deviation of the short-time energy and zero¬ 
crossing measures are first computed during the first 50 ms of recording, assuming there 
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is only background noise in that interval. An upper and lower threshold (shown as and 
T, on the upper plot of Figure 2.6.) for the short-term energy and another threshold 
(shown as on the lower plot of Figure 2.6) for the zero-crossing measure are set based 
on these statistics and experimental findings as follows: 


T, =SxMINSTE, 

(2.7) 

=32x MINSTE, 

(2.8) 

MINSTE = min(/E, mean(5rE) + std(5rE)) 

(2.9) 

IE = 0.25 

(2.10) 

= min(/F, mean(ZCi?) + std(ZCi?)), 

(2.11) 

IE = 0.25xN. 

(2.12) 


where N is the frame length; STE and ZCR stand for short-time energy and zero¬ 
crossing rate, respectively. 


Short-term Energy for the word "Left" 
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Figure 2.6. Typical Example of the Use of Short-Term Energy and Zero-Crossing 

Rate in Endpoint Detection. 
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The short-time energy curve is then searched to find the first and second 
successive crossings of the upper threshold . Then, it is necessary to move backward 
and forward from these first and second crossings, respectively, to seek the coarse 
endpoints (shown as and N 2 ), where the lower threshold 7] is first exceeded. This 

initial search yields the tentative endpoints, andT/j shown as dotted lines in Figure 
2.6. Next, the fine search on the zero-crossing plot is performed, moving toward the ends 
from Ny andNj for no more than 10 frames, examining the zero-crossing rate to find 

three occurrences of counts above the threshold . Last, the final endpoint estimate is 
moved backward and/or forward to the time of the first threshold crossing if three such 
occurrences are found. This is the case for andN 4 shown as solid lines in the lower 

portion of Figure 2.6. The endpoints remain at the initial estimate Ny and N 2 when three 

such occurrences over the zero-crossing threshold are not found. The MATLAB code for 
this detection algorithm is given in Appendix A. 1. 

Note that incorrect detection of speech boundaries in isolated word 
utterances results in potentially two negative effects: increased computations and 
misclassified recognitions [Ying, 1993]. In addition to the method of short-term energy 
and zero-crossing rate, experiments were conducted with the modified Teager’s energy 
for comparison. 

d. Endpoint Detection from Modified Teager’s Energy 

Assume a signal sample is shown as x,. = Acos(Q;+^/i),where A is the 

amplitude, Q is the digital frequency and (j) is the initial phase of the signal. In Teager’s 
energy algorithm, the instantaneous energy Ey of the sample Xy is computed by [Ying, 
1993]: 

Ei=X. -Xy^yXy_y 

= A"sin"(Q) (2.13) 

Equation (2.13) shows that both the square of signal amplitude and the 
square of corresponding frequency component are taken into account in Teager’s energy 
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computation. Ying et al. applied the frame-based Teager’s energy measure to the 
endpoint deteetion problem and developed the following algorithm [Ying, 1993]; 

1. Compute the power speetrum for each frame of speech data, 

2. Weight each sample in the power spectrum with the square of the 
frequeney, 

3. Take the square root of the sum of the weighted power spectrum. 

The resulting summation in the last step above produces the energy of one speech frame 
and this measure is used to determine the endpoints of an utterance. 

In this study’s experiments, the speech data is split into frames of 20 ms 
with 50% overlap. The Frame-based Teager Energy (FTE) was computed as stated above. 
Silence energy is determined from the first 10 frames of speech. The same energy 
thresholds T^ and from the previous detection algorithm (Rabiner and Sambur 

algorithm) are calculated using Eqs. (2.7) through (2.10) and a search scheme applied to 
determine where the FTE measure first exceeds the upper threshold. Finally, endpoint 
loeations are refined by searehing along the FTE eurve for the loeation where the lower 
threshold is first passed down. This method does not inelude the short-term zero-erossing 
measure in the search scheme. The final algorithm implemented in MATFAB is ineluded 
in Appendix A.2. 

In both deteetion algorithms, the assumption was that only noise is 
deteeted when the segment identified by this search scheme is less than 100 ms, as the 
words eonsidered in this study all have a longer duration. Thus, sueh deteetions are 
assumed to be false alarms and dismissed during the seareh. 

Experiments conducted on the same trials with these two detection 
algorithms produced eomparable results for loeating the speeeh boundaries in the given 
utteranees. As a result, the STE and ZCR detection algorithm first deseribed in Seetion 
(II.B.2.e) was used for the eropping task as it was simpler. Onee the words are eropped 
the next step is to extraet the speeeh features relevant to the data speetral and temporal 
characteristics. Speech feature extraction and vector quantization steps are discussed in 
the next chapter. 
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III. SPEECH EEATURE EXTRACTION AND VECTOR 

QUANTIZATION 

A. INTRODUCTION 

Linear Predictive Coding (LPC) coefficients, Reflection coefficients (RC), 
Cepstral coefficients, LPC-derived Cepstral coefficients (LPC-CC), Mel-frequency 
Cepstral coefficients (MFCC) and their variations are commonly used as speech feature 
parameters. 

Linear Predictive Coding (LPC) has been popular for speech compression, 
synthesis and as well as recognition since its introduction in the 1960s because it offers a 
reasonable engineering approach for speech signal analysis. Linear prediction models the 
human vocal tract as an infinite impulse response (HR) system that produces the speech 
signal. In other words, LPC makes use of an autoregressive (pole only) model to estimate 
a short-term spectral envelope for the speech spectrum with an emphasis on the peak 
spectral values that characterize voiced sounds. The corresponding pole-only transfer 
function of the spectral model has the following form [Gold, 2000]; 

H{z) = --. (3.1) 

k=l 

The linear prediction problem can be stated as finding the coefficients , which 
minimizes the mean-squared prediction error of the speech signal s{n) in terms of the 
weighted sum of its past samples [Rabiner, 1993]. 

For vowel sounds and other voiced portions of speech, which have resonant 
frequencies and high degree of similarity over short time intervals at their pitch period, 
linear predictive modeling produces an efficient representation of speech. However, 
resonant frequencies vary among people. Hence, LPC-based speech features tend to 
include speaker-dependency into modeling, which degrades the performance particularly 
for a speaker-independent recognition system. Furthermore, the all-pole constraint of the 
linear prediction model can cause poor modeling of spectral nulls in noisy environments 
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[Deller, 2000]. Hence, LPC parameters are not as popular today in speech recognition 
although they are still widely used for compression [Picone, 1993]. 

Unlike LPC analysis, Cepstral Analysis provides more robust speech features in 
that they are less dependent on speaker-dependent characteristics and improves 
recognition in noisy environments. In addition, some weaknesses of LPC parameters can 
be smoothed by using Cepstral parameters derived from LPC analysis [Davis, 1980; 
Picone, 1993]. These two reasons mainly explain why cepstrum coefficients started to 
replace the direct use of LPC coefficients as the premium speech features in Hidden 
Markov Model (HMM) speech recognizers in the 1980s [Deller, 2000]. The cepstral 
analysis approach is discussed next. 

B. CEPSTRAL ANALYSIS 

Among all popular speech parameters, the most functional and efficient ones 
extract spectral information (in the frequency domain) from speech, because a more 
concise and easier analysis of speech can be performed spectrally rather than temporarily 
(in the time domain) [Vergin, 1999]. Although speech signals demonstrate a range of 
inter-speaker variations for the same utterance in the time domain, this utterance still 
exhibits consistency in the frequency domain, to some extent. For this reason, spectral 
analysis is preferred over temporal analysis to discriminate between phonemes and 
extract speaker-independent features from speech signal. 

Cepstral analysis is a special case of homomorphic signal processing. A 
homomorphic system is defined as a nonlinear system whose output is a linear 
superposition of the input signals under a nonlinear transformation. Cepstral analysis has 
become popular in speech recognition since its discovery in the late 1960s, due to the 
powerful yet simple engineering model of human speech-production behind it [Rabiner, 
1978; Picone, 1993]. According to this linear acoustic model, a speech signal is produced 
by filtering an excitation waveform through the vocal tract filter as depicted in Figure 3.1. 
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S^)eech 
signal, s(n) 

Figure 3.1. Linear Aeoustie Model of Human Speeeh-Produetion (After [Picone, 

1993]). 

In this model, the speeeh signal is expressed as the convolution of an excitation 
signal e(n) with the vocal tract response h(n). The excitation sequence is either a quasi- 

periodic vocal cord pulse in the case of producing voiced speech or just random noise at 
the vocal tract constriction, which generates unvoiced speech [Deng, 2003]. 
Homomorphic signal processing offers a fairly simple method, known as cepstral 
deconvolution, to decouple the vocal tract response from the excitation response, thereby 
enabling it to model the vocal tract characteristics better. The decomposition of a speech 
signal s{n) into the excitation sequence e{n) and the vocal tract function h{n) can be 
described as follows [Picone, 1993]: 

s{n) = e{n)®h{n), (3.2) 

where the operator, “ ® “ represents the convolution operation. Recall that the 
convolution operation in time corresponds to a multiplication in the frequency domain. 
Thus, Eq. (3.2) becomes 

= (3.3) 

Note that the complex speech spectrum S{f) is composed of a quickly varying 
part, excitation spectrum E(f) (which corresponds to high frequency components) and a 
slowly-varying part, vocal tract response H{f) (which corresponds to low frequency 
components). Considering that the speech signal is real-valued, the logarithm of Eq. (3.3) 
on both sides leads to 

log(|5(/)|) = log(|E(/) • H(f)\) = log(|E(/)|) + log(|//(/)|). (3.4) 

Now that the signal components in Eq. (3.4) are linearly combined, a linear filter 
(also known as liftering operation in speech engineering terminology) can be applied to 
remove the noise-like, quickly-varying excitation part from the speech spectrum. Then, 


Exdtaticm 
Wavefonn e(n) 


Vocal Tract Filter 
H(f) 
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the inverse Fourier transform is applied to the remaining component to compute the real 
cepstrum. In short, under a cepstral transformation, the non-linear convolution of two 
signals e{n) 0 h{n) becomes equivalent to the linear sum of the cepstral representations 

of the signals, C^(s) + Ci^(s). As a result, the real cepstrum is the inverse Fourier 
transform of the logarithm of the power spectrum of a speech signal [Deng, 2003]: 

= for n = 0,l,...,A-l. (3.5) 


Figure 3.2 shows a block diagram representation of the short-term real cepstrum 
computation. 


Short-time 


Frames 


s{n) 



Real Cepstrum 
c^{n;m) 


Window 

Figure 3.2. Real Cepstrum Computation (After [Deller, 2000].). 


Figure 3.3 plots the real cepstrum computed for the voiced phoneme, /ae/ in the 
word “pan.” 
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Real Cepstrum for the voiced phoneme, /ae/ in the word, "pan" 



Figure 3.3. Real Cepstrum Computed for the Voiced Phoneme /ae/ in the Word “pan.” 

Figure 3.3 illustrates the fact that the dominant contents of the cepstra are located 
near the origin, and that a small number of cepstrum coefficients can be used to provide 
enough spectral information about the phoneme /ae/. In addition, pitch period information 
may be extracted by the spacing between successive cepstral peak locations (in this case, 
the pitch period is about 75 samples or equivalently 9.375 ms). 

Figure 3.4 plots the first 20 coefficients of the real cepstrum computed above, 
which shows that the first 8-10 coefficients carry most of the spectral information about 
the voiced phoneme /ae/. 
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First 20 coefficients of the real cepstrum computed for the phoneme, /ae/ 
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Figure 3.4. The First 20 Coefficients of the Real Cepstrum for the Phoneme /ae/. 

1. Mel-Frequency Cepstral Coefficient (MFCC) Computation 

In a broad sense, feature extraction aims for data reduction by converting the 
input signal into a compact set of parameters while preserving spectral and/or temporal 
characteristics of the speech signal information. Cepstral analysis has been extensively 
used for feature extraction in speech recognition. Perhaps, the most popular derivation of 
cepstral analysis combines the cepstrum with a nonlinear frequency-warping, known as 
mel-scale conversion [Deller, 2000]. The resulting coefficients are called Mel-frequency 
Cepstral Coefficients (MFCC) and these coefficients were used in this study. 

The motive behind the derivation of MFCC’s lies in the efforts to mimic human 
ear operation, and more specifically, the critical bands [Davis, 1980]. First, the spectral 
resolution of the human auditory system is not linear along the frequency axis [Deng, 
2003]. In other words, human sound perception is nonlinear. For example, humans can 
easily discriminate between closely spaced low frequency tones such as 300 and 350 Hz 
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but not between elosely spaeed high frequeney tones sueh as 3000 and 3050 Hz. Seeond, 
the human ear has some eritieal bands, meaning that only the stronger of two elosely- 
spaced frequeney tones falling into the same eritieal band will be pereeived by the ear. 
The eritieal bands of the human ear were experimentally determined by observing the 
frequeney band in whieh the pereeption of subjeets ehange abruptly as the frequeney of a 
sound is varied [Deller, 2000]. As a result, the mel-seale eonversion is expressed as linear 
frequeney spaeing below 1 kHz and a logarithmie spaeing above 1 kHz to mimie the 
speetral eharaeteristies of the human ear. Simply, it maps an aeoustie frequeney to a 
pereeptual frequeney seale as follows [O’Shaughnessy, 1987; Pieone, 1993]; 

=2595-log ,0 1 + ^;^ • (3.6) 

V /PP J 

In analogy, the operation of the inner ear is often viewed as a bank of bandpass 
filters spaeed linearly at low frequeneies and logarithmieally at high frequeneies. Davis et 
al. introdueed the mel-seale eritieal-band filtering with a simple set of 20 triangular 
bandpass filters [Davis, 1980]. Davis eomputed MFCC parameters for monosyllabie 
word reeognition as follows: 

20 [ f 1 ^ ; 7 - ~ 

MFCC =y X, COS n k -, for n = 0,1,2,...,M, (3.7) 

" tr LI 2j2oj 

where M is the number of MFCC eoeffieients and X^, k = \, 2,..., 20, represents the log- 
energy output of the k'^ fdter. 

The ehoiee of parametrie representations may signifieantly affeet reeognition 
performanees in isolated word reeognition applieations. Note that several feature 
representations exist for parametrieally representing the speeeh signal in speeeh 
reeognition. Davis et al. tested and eompared the performanees of mel-frequeney 
eepstrum eoeffieients (MFCC), linear frequeney eepstrum eoeffieients (LFCC), linear 
predietion eoeffieients (LPC), linear predietion eepstrum eoeffieients (LPC-CC), and 
refleetion eoeffieients (RC) for monosyllabie word reeognition with a template-based, 
dynamie time warping reeognizer [Davis, 1980]. Their primary eonelusion was that the 
MFCC parameters outperform other parameter types in speeeh reeognition applieations 
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with a compact set of coefficients capturing the information relevant for recognition. The 
basic conclusion of this paper is still considered valid today although there have been 
many improvements in speech signal representations since the early 1980s 
[Karnjanadecha, 2001]. 

According to Davis, MFCCs allow for better suppression of insignificant spectral 
variation in the higher frequency bands and form a particularly compact representation. 
Davis also reported that the Euclidean distance metric defined on the cepstrum 
parameters apparently allows a better separation of phonetically distinct spectra. 

Motivated by Davis et al. work, mel-frequency cepstral coefficients and their 
dynamic derivatives were used for feature extraction from the ear-microphone data 
during this study. Figure 3.5 shows the block diagram for the MFCC feature extraction 
step by step. The key difference between MFCC’s and the cepstral coefficients lies in the 
use of a mel-scale filter bank. This study’s MATLAB implementation for MFCC feature 
extraction is a modification of that available in the Voice Toolbox [Brookes, 1997] and is 
ineluded in Appendix B. 1. 



Figure 3.5. Block Diagram for the MFCC Feature Extraction. 

The details of the block diagram are described below. 

• Framing: First, the cropped speech data is blocked into frames of 
N = 256 samples (corresponding to 32 ms) with 40% overlapping to 
better capture temporal changes from frame to frame. 

• Windowing: Next, each frame is multiplied by a window function for 
short-time analysis. Applying a window which tapers down at the ends 
minimizes the spectral distortion due to discontinuities or abrupt changes 
at the window endpoints. Typically, a Hamming window is a good choice 
in speech analysis. Thus, a Hamming window of the form was selected; 
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w{n) = 


0.54-0.46cos 

0 


Inn 

N-\ 


n = OX-,N-\. 
otherwise. 


(3.8) 


The window length is ehosen as a trade-off between frequeney resolution 
and temporal resolution [Beechetti, 1999]. Recall that a short duration 
window allows for the detection of the amplitude decay of formants with 
better temporal resolution. However, using a longer window allows for 
better spectral resolution. Although a longer window is useful for 
capturing fine spectral variations, it is also in contrast to the requirement 
for the quasi-stationary analysis of signal segments. As a result, a window 
length equal to the frame length {N = 256) was selected. 


Fast Fourier Transform (FFT): the Fast Fourier Transform is a fast 
implementation of the Discrete Fourier Transform (DFT) which converts 
A-samples of frames into frequency spectrum. 

Mel-frequency Warping: The mel-scale filter bank implementation used in 
this study includes 24 triangular filters non-uniformly spaced along the 
frequency axis, as shown in Figure 3.6. 

Log Energy Computation: This step applies a logarithm transformation to 
the absolute magnitude of the coefficients obtained after mel-scale 
conversion. The absolute magnitude operation discards the phase 
information while the logarithm operation performs a dynamic 
compression, making feature extraction less sensitive to speaker- 
dependent variations. [Beechetti, 1999]. 


Mel-frequency Cepstrum Computation: Mel-frequency cepstral 
coefficients are computed by applying the inverse FFT to the logarithm of 
the magnitude of the filter bank outputs. Note that the inverse DFT 
reduces to a Discrete Cosine Transform (DCT) operation as the log 
magnitude spectral of the coefficients are real and symmetric [Beechetti, 
1999]. Moreover, the DCT has the advantage of producing highly 
uncorrelated features [Jayant, 1984; Deng 2003]. The resulting mel- 
frequency cepstral coefficients are computed as follows: 


iV 

c{k) = w(k)^log(|x,(n)|)cos 


n=\ 


n{2n -V){k -V) 
2N 


k = \,...,L, (3.9) 


where 



, k=\. 

, 2<k<L. 


N = 256 (window length). 


(3.10) 


(3.11) 
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Amplitude 


L = 24 (Number of filters in the mel-seale filter bank), 


(3.12) 


and (n) represents the output of the k"' filter in the filter band. 


Mel-scale Filter Bank 



Figure 3.6. Mel-seale Filter Band 

The zero-order MFCC coefficient represents the average log energy of 
the frame. This coefficient is usually discarded from the feature space 
because absolute power measures are not reliable in speech recognition 
[Picone, 1993; Becchetti, 1999]. Experimental results demonstrate that 12 
coefficients were sufficient to represent the speech data under 
investigation. Therefore, the zero-order coefficient was excluded and the 
remaining first 12 MFCC coefficients kept for a compact representation. 

• Cepstral Mean Correction (CMC): It is inevitable that a speech signal is 
subject to some spectral distortions during recording and pre-processing 
prior to recognition due to microphone offsets, environmental effects (like 
reverberation) and/or transmission chaimel [Becchetti, 1999]. The cepstral 
mean correction was applied to compensate for such distortion by 
subtracting the cepstral mean of a frame from the cepstral coefficients for 
additional robustness in recognition. 

• Delta-MFCC Computation: Feature vectors consisting of only MFCC 
parameters appear to provide smooth estimates of the local spectrum. 
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However, these features do not contain information regarding the speech 
signal dynamic evolution, which also carries relevant information in 
speech recognition [Becchetti, 1999]. Therefore, improvements in 
recognition performance can be obtained by taking into account the 
dynamic characteristics of the MFCC features [Deng, 2003; Furui, 1992], 
The simplest approach to obtain these dynamic features takes the basic 
difference of coefficients between consecutive frames. The resulting 
coefficients, known as dynamic or delta parameters, reflect cepstral 
changes over time. Furui et al. reported significant improvements in 
recognition due to the addition of the differential features to the 
instantaneous coefficients [Furui, 1986]. However, researchers have also 
argued that the basic differencing operation is too sensitive to random 
inter-frame variations and should be replaced by a smoother estimate of 
the local time-derivative [Furui, 1986; Deller, 2000; Deng, 2003]. 
Consequently, most speech recognition systems today incorporate delta 
features as an add-on to the static parameters such as MFCCs [Gold, 
2000]. For these reasons, 12 delta-MFCC parameters were included in this 
study’s set of speech features in addition to the regular MFCC parameters. 
The delta-MFCC coefficients are computed using the following regression 
formula: 


M 

- M -^ ( 3 - 13 ) 

a=\ 

where the configuration parameter for the difference length is M = 4 and 
d^. represents the delta coefficient computed in terms of the corresponding 

static coefficients to . 

The regression formula shown in Eq. (3.13) is equivalent to passing the 
static MFCC coefficients through a linear differential filter with impulse 
and magnitude response are shown in Figure 3.7. One potential problem 
with the use of delta MFCC parameters is that they are not in the same 
range of amplitude with the static MFCC parameters, which results in poor 
vector quantization. Through experiments, it was found that the delta 
parameters approach the range of static parameters when they are 
weighted by a factor of six. 
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Figure 3.8 shows a typical example of the resulting 12 MFCC and 12 weighted 
delta-MFCC parameters extracted from the ear-microphone data for the spoken word 
“pan.” Figure 3.9 plots the cepstrogram for the same 24 parameters. 
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12 MFCC plus 12 delta-MFCC parameters for the word "Pan" 



Figure 3.8. MFCC and Delta MFCC Parameters Extraeted from the Ear-Mierophone 

Data of the Spoken Word “Pan.” 


12 MFCC plus 12 delta-MFCC parameters for the word "Pan" 



0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 

Time (s) 

Eigure 3.9. Cepstrogram Plot of the MECC and Delta MECC Parameters Extraeted 

from the Word “Pan.” 
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2. LPC-derived Cepstral Coefficients 

Recall that LPC model can be used to compute a smoothed version of cepstral 
coefficients, known as LPC-derived Cepstral Coefficients (LPC-CC). In addition to 
MFCC parameters, LPC-CC parameters were tested for performance comparison in 
recognition. First, LPC coefficients were computed by using the auto-correlation method 
and Levinson-Durbin recursion. The use of the auto-correlation method guarantees that 
the LPC model is inherently stable [Picone, 1993]. Second, the LPC parameters are 
converted to 12 cepstral parameters by using the following relation [Deller, 2000]; 


y(n;m) = a(n;m) + ^ — y(k;m)a(n-k;m), n = \,2,..M, 


(3.14) 


where a{^n;m^ represents the LP coefficients and M is the order of the LPC model. 
Next, these cepstral coefficients were weighted by a window function (the raised sine 
lifter) (n), which tapers at both ends. Equations (3.15) and (3.16) show this window 

function (n) and the resulting weighted cepstral coefficients (n;m ), respectively: 


w^{n) = \ + 


M . 

— sm 
2 



(3.15) 


y^[n;m) = y[n;m)w^{n). (3.16) 

Finally, 12 delta LPC-CC parameters were computed by using the following 
difference function; 


= + n = (3.17) 

and included in the feature vector as done previously with the MFCC parameters. The 
MATLAB implementation used for LPC-CC feature extraction was that available at 
[Sorensen, 2005] and is included in Appendix B.2. Experimental results showed that the 
set of MECC parameters outperformed the EPC-CC parameters in isolated word 
recognition by 7%, on average. 

The next section discusses vector quantization and codebook generation. 
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C. VECTOR QUANTIZATION AND CODEBOOK GENERATION 

Pattern recognition provides the teehnieal foundation that underlies veetor 
quantization teehniques in speeeh reeognition [Gold, 2000]. The goal of pattern 
reeognition is to elassify objeets of interest into a number of elasses (or elusters). In this 
ease, the objects of interest are sequences of feature vectors extracted from the ear- 
microphone data using the techniques described in the previous section. 

A supervised pattern recognition system is trained with labeled samples; that is, 
each input pattern has a class label associated with it. Pattern recognition can also be 
performed in an unsupervised fashion when information about the class membership of 
the training vectors is not provided. In the case of speech recognition, the problem of 
automatically separating training data into classes is often solved by a clustering 
algorithm in an unsupervised fashion [Deller, 2000]. For example, vector quantization is 
such a technique in which speech features vectors are clustered around some centroid 
locations. The resulting cluster centers constitute a codebook. Then, the codebook can be 
used to index a new feature vector by finding the cluster center (centroid) that is closest 
in some norm to the new vector. 

I. Vector Quantization (VQ) 

Quantization converts a continuous-amplitude signal to one of a set of discrete- 
amplitude signals, which is different from the original signal by the quantization error. 
The independent quantization of each signal parameter separately is termed scalar 
quantization, while the joint quantization of a vector is termed vector quantization (VQ) 
[Makhoul, 1985]. Vector quantization can be thought of as a process of redundancy 
removal that makes the effective use of nonlinear dependency and dimensionality by 
compression of speech spectral parameters. Generally, the use of vector quantization 
results in a lower distortion than the use of scalar quantization at the same rate [Sayood, 
2000 ]. 

Vector quantization has been significantly popular in speech coding and 
recognition after the introduction of LPC [Makhoul, 1985]. The VQ problem can be 

summarized simply in the following manner. Suppose that v = is an 

L-dimensional column vector whose components |v^'\1</<l| are real-valued. 
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continuous-amplitude random variables. In veetor quantization, the veetor x is mapped 
to another real-valued, diserete-amplitude, L-dimensional eolumn veetor, j . Thus, the 
quantization funetion is of the form 

y = Q(x). (3.18) 

Typieally, y takes on one of K distinct values sueh that F = {y;.M<7< 

where y^. represents a veetor of the form y^. = [y*'\ y^.^\..., y*^'^ . The set Y is referred 
to as the codebook, and eaeh y. is a code vector. To generate sueh a eodebook, the 
L-dimensional spaee of the speeeh feature veetor x is partitioned into K mutually 
exclusive regions or elusters such that |C^,1 ^ j ^ and a code veetor y^. is assoeiated 
with eaeh cluster Cj. The vector quantizer then assigns the eode veetor y^. to the feature 
veetor x. if the feature veetor x,. is in the cluster Cj , i.e., 

Q{x) = yj, ifxeC.. (3.19) 

For L = l, veetor quantization reduees to scalar quantization. Figure 3.10 
illustrates a typieal example of uniform elustering of two-dimensional spaee (L = 2) for 

veetor quantization. This illustration is a uniform elustering in that the two-dimensional 
spaee is partitioned into four elusters equally. Unfortunately, speeeh data are not 
generally this well-behaved in terms of uniform elustering and low-dimensionality 
[Gold, 2000]. 
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Figure 3.10. A Typical Example of Uniform Clustering of Two-Dimensional Space for 

Vector Quantization. 

2. Distortion Measure 

When X is quantized as , a quantization error results and a distortion measure 
(or a distance measure) d(x,yj) can be defined between feature vector x and the code 
vector (or codeword) yj. 

The squared Euclidian distance is one of the most popular distortion measures in 
speech recognition applications. The quantized code vector is selected to be the closest in 
Euclidean distance from the input speech feature vector. The Euclidean distance is 
defined by; 

( 3 . 20 ) 
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where is the i"' eomponent of the input speech feature vector, and is the 
component of the codeword yj. A vector quantizer is said to be optimal if it meets the 
following two necessary conditions. 

• Nearest Neighbor Condition: The optimal quantizer is realized by using a 
minimum distortion or nearest neighbor condition such that 

Q{x) = y., iff J(x, y^.) < y^,) for 7 A: and V/c = l,2,...,.fir. (3.21) 

Eq. (3.21) states that the vector quantizer selects the code vector that 
results in the minimum distortion with respect to x. Thus, the cluster Cj 

should consist of all feature vectors that are closer to y ■ than any of the 
other code vectors such that: 

C. =|x:||x-yj^ <||x-yj^ \/k = (3.22) 




Centroid Condition: Each vector yj is chosen to minimize the average 
distortion in cluster C^.. Such a vector is called the centroid of the cluster 
Cj, which is equivalent to the statistical mean of the cluster members. 
Thus, the centroid of the cluster C- is given by: 


y. = centroid 


for 7 = 1,2, 

Xm^C j 


K. 


(3.23) 


A popular method for codebook design is an iterative clustering algorithm known 
as the K-means algorithm. This algorithm divides the set of training feature vectors into 
K clusters Cj in such a way that the two necessary conditions for optimality are satisfied. 

The .fiT-means algorithm is described next. 

3. ilT-Means Algorithm 

The .fiT-means algorithm is widely used in speech processing as a dynamic 
clustering approach [Deller, 2000]. is pre-selected and simply refers to the number of 
desired clusters. Einde, Buzo and Gray et al. were first to suggest the use of the .^T-means 
algorithm for vector quantization with a large class of distortion measures [Einde, 1980]. 
Consequently, the .^T-means algorithm has also been named after Einde, Buzo and Gray as 
the generalized LBG algorithm in speech processing literature. 
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The LBG algorithm was chosen to execute the vector quantization step for this 
research. The objective of the LBG algorithm is to define the regions C- and their 

representatives so that the overall distortion is minimized. One way to compute the 

code vectors of the training set is to start with an arbitrary random initial estimate of the 
code vectors and to apply the nearest neighbor condition and the centroid condition 
iteratively, until a termination criterion is satisfied [Theodoridis, 2003]. 

Thus, a set of K code vectors is discovered into which all speech feature vectors in 
the training set can be quantized with minimum distortion. This set of code vectors 
represents a codebook for the feature space. Figure 3.11 shows the flow diagram for the 
^-means clustering using the generalized LBG algorithm. 



Figure 3.11. Flow Diagram for the .fiT-means LBG Clustering Algorithm. 

The operation of the generalized LBG algorithm is summarized as follows. 
[Deller, 2000]. 
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• Initialization. Choose an arbitrary (random) set of K code vectors, say 
y^, j = l,2,...,K, where represents the initial centroid (or initial mean 

value) of the feature vectors in cluster Cj. 

• Iteration. 

1. Apply the nearest neighbor rule for each feature vector x in the 
training set to quantize v into code vector y. , where 

j = argmin[j(;c,y,.)], i = \,2,...K. (3.24) 

i 

Here J(-,-) represents the squared Euclidean distance in the feature space. 

2. Compute the total distortion that has occurred as a result of this 
quantization in terms of the squared Euclidean distance, 

D = Y,d[x,Q{x)\, (3.25) 

where the sum is taken over all vectors x in the training set and Q{x) 
indicates the code vector y^. to which x is assigned in the current 
iteration. 

3. Apply the centroid condition for updating the codebook. Eor each 
cluster, compute the centroid of all vectors .r such that 

yj =centroid{Cj^ = Q[x) during the current iteration. Update the 

codebook based on the new set of centroids and return to Step 1. 

• Termination: Iterations in step 1 through 3 repeat until a specific 
termination criterion is met. Termination criterion might be determined in 
various ways, such as reaching a maximum iteration number or having a 
sufficiently small squared Euclidean distance. Specifically, the goal is to 
terminate the iteration when the difference between the updated code 
vectors and the previous code vectors is sufficiently small. 

Eike many other numerical minimization algorithms, the .fiT-means algorithm can 

converge to a local optimum instead of a global optimum based on the starting point 

(randomly-selected initial codebook). The distortion measure is likely to converge 

towards the local minimum closest to the initial codebook. Eor this reason, the algorithm 

should be executed several times with different initial codebooks to achieve some type 

global optimality. Then, it is possible to choose the optimum codebook, which results in a 

minimum overall distortion. 
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4. Codebook Generation from Speech Features 

Since the diserete observation Hidden Markov Models (HMM) classifier will be 
used to reeognize the words of interest, this study is restricted to the production of 
discrete observation symbols as input to the classifier. In other words, the naturally 
oecurring continuous-valued observation vectors should be quantized into indices of a 
codebook before elassifieation. 

Next, the utilization of vector quantization in this study is detailed. A training 
matrix of size (F by r) is formed by concatenating the first 12 MFCC and 12 delta- 

MFCC parameters extraeted from all the training utterances. Each column of this training 
matrix represents F speeeh features of a given frame and T is the number of all frames. 
Given the size of the eodebook K, this randomly-seleeted training matrix is used to 
generate the eodebook representing the aeoustic information of all the words of interest. 
To train the eodebook properly, the training feature vectors should span the anticipated 
range of acoustic information used for reeognition. Two thirds of the overall data is 
randomly ehosen to ensure that the training data for eodebook generation is sufficiently 
large. In addition, it is also neeessary to choose K random code vectors as initial 
conditions. 

Each feature veetor in the training matrix is then clustered based on the squared 
Euelidean distance between a training vector and each of the code vectors in the 
eodebook. Once all training veetors have been labeled with the eorresponding code 
vectors, the total distortion is computed by calculating the squared Euelidean distance. 
Summation of all these distances represents the total quantization error of the eodebook. 

The mean of each cluster’s members is eomputed to find the centroid loeation and 
eaeh eentroid loeation replaees the corresponding old eode vector. The absolute 
differenee between the updated eode vectors and the previous code vectors determines 
the number of repetitions required before terminating the algorithm. 

At this point, an important issue is to deeide upon the number of required code 
vectors (centroids) in the eodebook for a quality representation of speeeh features. One 
measure of the quality of a eodebook is the average distance of a feature vector in the 
training data from its corresponding code veetor. This figure is often ealled the eodebook 
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distortion. Obviously, the smaller the distortion, the more accurately the centroids 
represent the vectors they replace. A plot of the average distortion versus codebook size 
for this research is shown in Figure 3.12. 
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Figure 3.12. Average Distortion versus Codebook Size. 

Note that the performance of any codebook would deteriorate if it is used with 
feature types for which it has not been designed [Makhoul, 1985]. 

Generally, a codebook with size larger than the number of phonemes (basic 
speech units) in the English alphabet is preferred to account for the temporal and spectral 
variability in speech [Rabiner, 1993]. Vector quantization provides a means to reduce 
storage for speech information, discrete representation and reduced computation for 
speech modeling. However, Note that vector quantization also causes an inherent 
distortion in representing the original feature vectors and requires training time to 
generate a proper codebook. 
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Reasonably, it should be intended to keep the codebook as small as possible 
without decreasing performance since a large codebook implies increased computation 
and training time. Note that the centroids loosely correspond to different acoustic 
phonemes in the speech. Thus, the proper codebook size can be determined by examining 
the acoustic complexity of the vocabulary. If the vocabulary is small, fewer symbols 
might be sufficient [Deller, 2000]. 

Consequently, considering all the VQ issues discussed above, a codebook of 128 
code words was generated by using the .fiT-means (LBG) algorithm on the training 
sequence. The MATLAB implementations used for .^T-means algorithm and codebook 
generation were those available at [Sorensen, 2005] and are included in Appendices B.3. 
and B.4., respectively. 

Next, the continuous-valued feature vectors on the testing sequence were replaced 
with the codebook indices so as to obtain discrete symbols for classification, as shown in 
Figure 3.13. 


Training set of 
feature vectors 


Squared-Euclidean 
Distance d(-, •) 



Codebook indices 
\<k<K 


Figure 3.13. Block Diagram for Vector Quantization of Speech Feature Vectors (After 

[Rabiner, 1993]). 


Vector quantization converts the speech feature vectors (one feature vector per 
frame) to the discrete observation sequence O = {oj, 02 ,...,o,,...,o^} where each of the 

variables o, may only take integer values k from the trained codebook index for 
\<k<K. Now that a discrete set of symbols associated with each speech signal exists, it 
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is possible to build a discrete-symbol Hidden Markov Model (HMM) classifier. The next 
chapter discusses how to build such a HMM classifier for isolated word recognition. 
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IV. HIDDEN MARKOV MODELS (HMM) CLASSIEIER EOR 
ISOLATED WORD RECOGNITION (IWR) 

A. INTRODUCTION TO HMM 

Natural speech is a time-varying and non-stationary signal. Speech signal 
modeling helps characterize a set of distinct sounds inherent in speech, such as 
phonemes, syllables or other subword units thanks to their distinguishing acoustic 
representations. The fundamental problem in speech signal modeling is that the temporal 
structure of a specific speech sound is not thoroughly well-defined or easily-predictable 
due to a wide range of variability in its nature. When all combined, phonetic variability 
(due to the acoustic differences of the same phoneme in different words), background 
noise (owing to the microphone or channel distortion), intra-speaker variability (because 
of the changing physical or emotional condition of the same speaker) and most 
importantly inter-speaker variability (as a result of different utterances of the same word 
by different speakers) have tremendous adverse impact on the success of any speech 
signal model in use. Figure 4.1 illustrates the speaker-variable nature of speech on the 
different utterances of the same word “Right” spoken by four speakers. Although it is 
clearly seen that all the waveforms in Figure 4.1 have a common temporal structure, it is 
possible to observe the significant variations over time in detail from one repetition of the 
word to the other. 

In a broad sense, signal models can be categorized into two classes; namely 
deterministic models and statistical models [Rabiner, 1989]. In the first class, the 
Dynamic Time Warping (DTW) approach provides a straight-forward, non-parametric, 
template-based model which aims to match the incoming speech to the distinct templates 
generated by normalizing time variations of speech sounds. The templates are generated 
via training and generally represent coarse speech units like words. Although the DTW 
approach is relatively simple and straightforward, it has a number of known limitations in 
speech recognition due to the variability in speech patterns [Gold, 2000; Deller, 2000]. 
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Figure 4.1. Inter-Speaker Temporal Variability in the Word “Right” Spoken by Four 

Different Speakers. 


The second class of statistical speech signal models better accounts for non-linear 
variability in speech waveform, thereby outperforming the models of the first class. Most 
statistical approaches inherently model speech as being generated based on some 
probability distributions. The use of Artificial Neural Networks (ANN) and the Flidden 
Markov Models technique have mainly constituted relatively elaborate and yet 
extensively-used statistical models in speech recognition field thus far [Deller, 2000]. 

The basic theory behind the Hidden Markov Models (HMM) dates back to the 

late 1900s when Russian statistician Andrei Markov first presented Markov chains. 

Baum and his colleagues introduced the Hidden Markov Model as an extension to the 

first-order stochastic Markov process and developed an efficient method for optimizing 

the HMM parameter estimation in the late 1960s and early 1970s [Deller, 2000]. Baker at 

Carnegie Mellon University and Jelinek at IBM provided the first HMM implementations 

to speech processing applications in the 1970s [Rabiner, 1993]. Proper credit should also 
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be given to Jack Ferguson at the Institute for Defense Analysis for explaining the 
theoretical aspects of three central problems associated with HMMs, which will be 
further discussed in the following sections [Rabiner, 1989]. 

The technique of HMM has been broadly accepted in today’s modem state-of-the- 
art ASR systems mainly for two reasons: its capability to model the non-linear 
dependencies of each speech unit on the adjacent units, and a powerful set of analytical 
approaches provided for estimating model parameters. 

This research strictly deals with a discrete observation HMM implementation of 
an Isolated Word Recognition (IWR) system to justify the effectiveness of the airflow 
signals collected within the external air canal for speech recognition. Each word of 
interest consists of a sequence of phonemes and each phoneme is related to the values of 
the features extracted from an acoustic speech signal. Ordinarily, the method of HMM 
makes it possible to capture the non-linear variability in pronunciations (or utterances) 
using probabilities. The underlying assumption of the HMM approach is that the speech 
signal can be characterized as a random stochastic process if it is considered to be a 
sequence of quasi-stationary sounds. In other words, the HMM can be employed as the 
piecewise stationary model of a nonstationary speech signal [Deng, 2003]. Further, it is 
then possible to estimate the parameters of this stochastic process in a well-defined linear 
model stmcture [Rabiner, 1989]. 

This chapter first starts with the definition of the isolated word recognition (IWR) 
and presents the preliminary notations that will be used in the discussions of the Markov 
Process and Hidden Markov Model. Next, the Discrete-time Markov process is 
introduced and the ideas to the Hidden Markov Model (HMM) extended. 

Understanding the HMM framework is essential to the successful recognition of 
the airflow signals collected within the external air canal in this research. The three 
central problems associated with decoding the hidden state sequence, recognition and 
training of the HMM are discussed. Efficient solution algorithms to these problems are 
detailed. Also explained are some important implementation issues, such as scaling and 
training with multiple observation sequences. The chapter concludes with the HMM 
application to the isolated word recognizer (IWR) used in this research. 
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1. Definition of the IWR Problem 

In an IWR system, the goal is to preserve the temporal and spectral information 
needed to determine the phonetic identity of speech units and ignore factors like the 
background noise distortion or undesirable speaker-dependent impacts. In practice, a 
model is constructed for each word from a small vocabulary. Thus, each word has a 
probabilistic model of some observable symbol sequence. The task of the Isolated Word 
Recognition (IWR) system can be defined as finding the best matching word out of an L - 
word vocabulary whose model is most likely to generate the given observation sequence 
of feature vectors extracted from a speech signal. Figure 4.2 illustrates the simplified 
mechanism of a discrete observation HMM classifier for IWR task works. 


Acoustic 

Speech 

Signal 


Phonemes 

Vector 

Quantized 

Features 


HMM 


Figure 4.2 An Illustration of a Discrete Observation HMM Classifier for IWR. 

Assume that an observation sequence (or a set of symbols) corresponds to a set of 
feature vectors extracted from an acoustic signal. Let this observation sequence be 
represented by O = {o,, 02 ,...,o^}. Further, assume that the models of the words to be 

recognized correspond to W = {wj,W 2 ,..., . Then, the following equation estimates the 
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most likely word out of the L-word voeabulary, given the observation sequenee O 
[Jelinek, 1997]. 


w = arg max 

WeL 




(4.1) 


. , . p(y|x)p(x) 

By applying the Bayes Rule, P(.r y) = ——-/-r - to Eq. (4.1), the reeognition 

P(y) 


task transforms to the following equation: 

;[p(1E|0) 


w = arg max | 

WeL 


= arg max 

WeL 


P(0|Vk)P(lE) 

f(0) 


(4.2) 


In Eq. (4.2), P(IE) is the prior probability of having that partieular model, whieh 
produees the word to be reeognized. Eortunately, P{0), the probability of the 

observation sequenee, ean be ignored sinee this probability is eonstant when attempting 
to maximize over the possible word models [Jelinek, 1997]. Henee, the resulting equation 
beeomes: 

u’ = argmaxrp(o|iy)P(lE)1. (4.3) 

P(o|iy), the probability of having a particular observation sequence O given the 

word model W , will be obtained by using Hidden Markov Modeling. 

2. Notations 

Different sets of notations for Hidden Markov Modeling exist in the literature. 
Table 4.1 lists the preliminary notation that represent the basic HMM concepts. More 
advanced concepts follow on these notations. These notations are selected from [Deller, 
2000; Rabiner, 1989; Dugad, 1996] for convenience and consistency throughout the 
chapter. 
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Notation 

Definition 


Q is the set of possible states where q, is the 
state at timet and S is the number of states. 

0 { ^1 5 ^2 9 • • • 9 9 • • • 9 1 

0 is the set of possible observation symbols 
where K is the number of symbols in the 
codebook and o, is the symbol observed at time t. 

A = 

"a(ll) a(l2) ... a(l5)^ 

: : 

a(5 l) a(5 2) ••• a(5 5)^ 


A is the S-by-S State transition matrix where 
= = j) is the probability of 

being in state i at time t given the previous state 
was 7 at timet-1. Columns of A must sum to 
unity. 

/ 

B = 

'fc(l|l) a(\\2) ... fc(l|s)' 

b(A: 

a(K\\) a(K\2) ••• a(i^ 5) 

\ 

/ 

B is the K-by-S Observation probability matrix 
where b{^k\i^ = P(o, =k\q,= is the probability 

of having the observation k at time t given the 
model is in state i at time t . Columns of B must 
add up to unity as well. 

n = 

P{q,=i) = ^,, 

P{q,=S) = 7:^_\ 


n is the S-by-1 Initial state probability vector 
which shows the probabilities of having any state 
at time t = 1. Elements of the column vector n 
must also sum to unity. 

X = [A,B,7r] 

X is the compact notation for a HMM. 

a,[i) = P[o^,o^,...,o,,q^=i\X). 

The forward variable. 


The backward variable. 


Table 4.1. List of Basic Notations for HMMs. 

3. Discrete-Time Markov Process 

The Markov Chain is the underlying stochastic process of Hidden Markov 
Models. A finite-state Markov Chain is a stochastic discrete-time Markov process whose 
output is a set of S distinct states where each state corresponds to an observation. 
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[Rabiner, 1989]. State transitions (transition back to the original state is also possible) in 
a discrete-time Markov process can occur based on only two parameters: 

• State transition probability, a{i\j^ = P[qf=i\q^_^ = where 

s 

and = ^ forVj. 

(=1 

5 

• Initial state probability, P[q^=i) where 1 < / < S and 

i=l 

There are two distinct characteristics of a Markov process. First, there is one-to- 
one correspondence between the observation sequence and the Markov Chain state 
sequence. In other words, observations are deterministic. Second, the Markov Chain state 
sequence is also observable. 

The following example provides a better visualization of a discrete-time Markov 
process. 

a. Example 4.1: Three-State Observable Markov Model 
Assume that a Christmas tree is ornamented with a decorative light which 
takes on three colors in a sequence based on some priori probabilities. Red (R) color 
corresponds to state 1, green (G) color to state 2 and blue (B) color to state 3. Model 
parameters are given as follows: 

'a(l|l) = 0.5 a(l|2) = 0.6 a(l|3) = 0.3" 

A= a(2|l) = 0.3 a(2|2) = 0.4 a(2|3) = 0.6 and 

a(3|l) = 0.2 a(3|2) = 0 a(3|3) = 0.1 

P(^,=l) = 0.4' 
n= P[q^=2) = 03 . 

=3) = 0.3 

This three-state Markov model is illustrated in Figure 4.3. 
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Given a sequence of observed colors when the decorative lights are turned 
on, O = \K-G-G-R-B-B^ , the only corresponding state sequence is 

Q = [s^,S 2 ,S 2 ,s^,s^,s^]. The probability of having this observation sequence can be 
computed as 

P(0\A) = P(R-G-G-R-B-B\A) = P(q\A) = P(s^,S2,S2,s„S2,S2\A) 

= P{q^=l)P(s2\s,)P(s2\s2)P(s,\s2)P(s2\s,)P(s2\s2) 

= 0.4x0.3x0.4x0.6x0.2x0.1 = 5.76x10^ 

Finally, also note that an nth-order discrete-time Markov process is a sequence of 
random variables which depends only on the preceding n variables. A first-order Markov 
model is typically used in speech recognition [Gold, 2000]. 

Next, the extension of discrete-time Markov process to Hidden Markov Models 
(HMMs) is addressed. 

4. Extension to Discrete-Symbol HMMs 

In a stochastic Markov Model, each state is associated with a deterministically 
observable event and hence the name observable Markov Model is also given. The 
Hidden Markov Model can be considered an extended version of an observable Markov 
Model where the observation is a probabilistic function of the state [Rabiner, 1993]. 
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Given an observation sequence, there is no direct access to the underlying state sequence 
which generates the observations in a HMM; hence the state sequence is hidden. 

A Hidden Markov Model can be characterized by the following five model 
parameters: 

• S , number of hidden states in the model. 

• K, number of distinct observation symbols. 

• The state transition probability distribution, A = 7 , where 

a(i\j) = P(qt = i\qt-i = j) and 1 < i,j < S. 

• The observation symbol probability distribution, B = ^b{k\i^, where 
b[k\i^ = P{^o^=k\q,=i^, \ <i<S, \ <k<K. 

• The initial state distribution, ;r = |P(^j=/)|, where 1 < / < 5. 

The compact notation, X = {^A,B,X) is usually used to define the model 
parameters of an HMM for convenience. 

To better explain the extension of the observable Markov Model to HMM, the 
following two fundamental assumptions behind HMMs are discussed as well [Deng, 
2003]. 

• First-order Markov process assumption which states that the state 
transition at time t depends only on the previous state at time t-\, as 
expressed in the following equation. Note that state transition probability 
is not time-variant. 

P{Q\^) = P(q„...,q„-,qT |^)= P(qi (4.4) 

t=2 

= n^a[q^\q,)a(q^\q^)...a(q^\q^_,). (4.5) 

• Independent observation assumption which states that an observation is 
only dependent on the particular state that generates it and is independent 
from its preceding observations, as shown in Eq. (4.5). 

T 

P[o\Q,X) = P[o^,o^,...,o„...,Oj\q^,q^,...,q,,...,q.j.,X) = Y\P{o,\q„X). (4.6) 

t=l 

= b[o,\q,)b[o^\q^)...b[oj\qj). (4.7) 
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The following two examples illustrate two different struetures of HMMs. 

a. Example 4.2: Three-State Ergodic Hidden Markov Model 
The set-up of our previous example is extended with the Christmas tree 
light to ineorporate the hidden states and the non-deterministie observations in this 
example. Suppose a Christmas tree light whieh turns on any one of the three eolors; red 
(R), green (G) and blue (B) eaeh time a button is hit on a remote eontrol deviee (RCD). 
Assume that a kid ehanges the eolors of the Christmas tree light by using any one of three 
RCDs. Eaeh RCD is biased for the generation of eolors. There is no direet aeeess to the 
information about whieh partieular RCD is used whenever a ehange of eolor oeeurs as an 
observation. Thus, the following three hidden states oeeur. 

• State 1: RCDl is used. 

• State 2: RCD2 is used. 

• State 3: RCDS is used. 

The matrix A of state transition probabilities is given as follows: 

=l|^,_i =l) = 0.2 P(l|2) = 0.3 R(1|3) = 0.1 
A = {a(/|j)}= =2|^,_i =l) = 0.4 P(2|2) = 0.4 P(2|3) = 0.7 

=3|^,_i =l) = 0.4 r(3|2) = 0.3 R(3|3) = 0.2 

The initial state probability veetor, tt is 

>(<?,=!) = 0.5“ 
n = P{cix = 2 ) = 0.2 . 

=3) = 0.3 

The matrix B of observation probabilities is expressed as follows: 

>(o, =l) = 0.6 P(r|2) = 0.3 P(r|3) = 0.4 

B = i^b(k\i)]= p(o, =G|<?, =l) = 0.1 p(g|2) = 0.5 p(g|3) = 0.3 

=l) = 0.3 P(fi|2) = 0.2 R(fi|3) = 0.3 
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This HMM is called an ergodic model in that any state ean be reaehed 
from any other state. In other words, an ergodie HMM allows for all possible state 
transitions at any time, as shown in Figure 4.4. 


0.4 



Given that the eolors 0 = [B-G-G-R] appear in a sequenee of 

observations, the following question ean be asked: What is the probability of having this 
observation sequence given the model parameters ? 

In this ease, it should be obvious that there are 3x3x3x3 = 81 possible 
eorresponding state sequenees, whieh might have generated the given observation 

sequenee, G. Thus, the probability, P[o\X^ is 

81 81 

P(0\A] = YP(0,Q,\X) = y^P(0\Q„A]P(Q\A]. 

t=l /=! 

For example, if it is known that the generating state sequenee is 
presumably 2 = -53 - 52 } > then it should be possible to eompute the probability, 

P{o\X^ as follows: 

P[0\X) = p[0\Q,X)p[q\X) = 0.0135x0.056 = 7.56x10^. 

b. Example 4.3: Five-State Left-to-right Hidden Markov Model 

Consider a bag-drawing system eonsisting of bags and coins as another 
example of HMM. Assume there are five bags and three different types of eoins; niekels. 
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dimes and quarters in eaeh bag. Eaeh bag has ten eoins in differing numbers from each 
type of coin. Six drawings from five bags are conducted when blindfolded. Therefore, it 
is not known which drawing comes from which bag. However, there is a restriction to the 
sequence of the bags to be used. Once a coin is drawn from any bag, the only choice is to 
choose from either the same bag or go to the one of the next two larger-numbered bags 
for the next drawing. For instance, Q = [bag^,bag 2 ,bag 2 ,bag^,bag^,bag^] is a valid 

sequence of bags for the drawing. Thus, the drawing bags correspond to the hidden states 
whereas the six coins drawn represent the observation sequence in the model. 

The model parameters are given as 


"0.2 

0 

0 

0 

0 " 






■0.4" 

0.5 

0.3 

0 

0 

0 


“0.7 

0.2 

0.4 

0.2 

0.5" 


0.2 

0.3 

0.4 

0.4 

0 

0 

, B = 

0.1 

0.5 

0.3 

0.6 

0.4 

and n = 

0.1 

0 

0.3 

0.5 

0.3 

0 


0.2 

0.3 

0.3 

0.2 

O.IJ 


0.2 

0 

0 

0.1 

0.7 

1 






0.1 


Note that the upper triangular portion of the state transition matrix A represents the 
backward state transitions, which is all zero. This type of model structure is a special case 
of the model known as a left-to-right model. In a left-to-right HMM, backward state 
transitions are not allowed. Speech has inherently a temporal structure and thereby can be 
modeled to fit to the left-to-right HMM in a well-behaved fashion. Figure 4.5 illustrates 
the structure of this specific type of HMM. 



Figure 4.5. Five-State Feft-to-Right HMM. 

Next, the three fundamental problems associated with the use of HMM 
and their respective solutions are discussed. 


68 



B. THE THREE FUNDAMENTAL PROBLEMS FOR HMM 

For a practical implementation of HMM, three basie problems of interest must be 
diseussed whose effeetive solutions will lead to the training of the model, that is the 
estimation of the model parameters, /I = |A,5,;r} , and finally the resulting elassifier. 
Rabiner et al. lists these eentral problems as follows [Rabiner, 1989]; 

• Problem 1: Given an observation sequenee O = and a 

HMM/l = {A,B,7rj , how is P{o\X^, the probability that the observation 

sequenee is generated by the given model, eomputed? Problem 1 is known 
as the evaluation or scoring problem. 

• Problem 2 : Given an observation sequenee G = {oj, 02 ,...,o,,...,Oj,} and a 

HMM/l = , how is the generating state sequenee 

Q* = , whieh aeeounts for the observations optimally, 

determined? A solution that would satisfy Q* =argmax P{^Q,0\X\ in 

this problem is sought. Problem 2 is also known as the decoding problem 
in that the goal is to uneover the hidden part of the model, the optimal 
state sequenee. 

• Problem 3: How is the model parameter set X = [A,B,n], whieh 
maximizes P(G|/l), the probability of the observation sequenee given the 
model, estimated? Problem 3 represents the training problem. 

1. Solution to the Evaluation Problem 

Computing the probability of oeeurrenee of a partieular observation sequenee 
given the model ean also be eonsidered a seoring problem in whieh the attempt is to 
determine the suitability of a given model to generate the partieular observation 
sequenee. This standpoint beeomes useful when aiming to ehoose the best model for a 
designated word among many eompeting models where eaeh model represents a word 
from a small voeabulary [Rabiner, 1989]. 

Basieally, there are two options available to evaluate the probability p(g|/ 1); the 

direet brute-foree evaluation and the forward-baekward proeedure. The former approaeh, 
ealled the direet evaluation, is straightforward albeit impraetieal for many real-world 
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applications. The latter method, called the forward-baekward proeedure, is extensively 
used to address the evaluation problem in HMM. 

a. The Direct Evaluation 

The brute-force approaeh to the eomputation of P{o\X^ starts with 
finding the eonditional probability P[o\Q,A,^ along a eandidate path (any allowable state 
sequence) Q = [q^,q 2 ,...,q,,...,qj] then multiplies it by the probability of sueh a path, 
p{q\X^. Summation of these produets over all possible state sequences aeeounts for the 
probability Pi^O\X^. Applying the independent output assumption, the joint output 
probability of observations along the path Q is eomputed 

P(o\Q,X) = Y\p(o,\q„X) = b[o,\q,)b[o^\q^)..b[o.,\q.,). (4.8) 

t=\ 

The probability of the path Q is eomputed using a first-order Markov 
assumption as follows: 

T 

p(q\^)= p(qi k-p^)=ki)«(<?3 kr-i)- (4-9) 

t=2 

Therefore, the probability of observation sequenee given the model is 
obtained by the following eomputation: 

P[0\X) = Xp{0\QA)p{Q\^) (4.10) 

allQ 

= Y,^c,p{o,\q,)a[q^\q,)j[o^\q^)..M[qj\qj_,)b[oj\qj). (4.11) 

allQ 

A close observation of Eq. (4.11) reveals the eomplexity of the 
computational requirements in the brute-force direct evaluation approaeh. Considering 
that there are distinct possible state sequenees and 27-1 multiplieations are 
performed in the summand of Eq. (4.11), there is a total of (27-1)5^ »27 • 5^ 
multiplieations and 5^-1 additions [Dugad, 1996]. Therefore, the brute-force 
eomputation is very expensive and it beeomes essential to eompute P[o\X^ more 
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efficiently. Next, we consider the forward-backward procedure which is a very efficient 
solution to the evaluation problem. 

b. The Forward-Backward Procedure 

First, the forward variable (i) is introduced and defined as: 

a,(i) = P(o^,02,...,o^,g^ =ilA). (4.12) 

The forward variable «, (/) is described as the joint probability of the partial observation 
sequence up to instant t, {oj, 02 ,...o,} and the state i at instant t, given the model A, . The 
Forward Recursion can be used to solve for a,{i) inductively as follows: 

1. Initialization; for 1 < / < 5, 

ai(/) = P(oi,^i =/|;L) = h(oil^i). (4.13) 

2. Recursion: for ? = l,2,...,r-l and 1< j<S, 

(4-14) 

^'=1 

3. Termination: for \ <i<S, 

s 

P[0\X) = Y,cCt{'^- (4-15) 

(=1 

In the first step, the forward probabilities are initialized as the joint 
probability of having the observation Oj at the initial state i for all possible states. In the 

second step, a,+i( 7 ), the joint probability of the partial observation sequence up to 

instant t + \ and the state j at instant ? + l is computed from «,(/), probabilities of the 

previous partial observations, recursively. The Lattice (Trellis) structure shown in Figure 
4.6 illustrates how state j at time t + l can be reached independently from any of the 
S states at time t. The third step of the forward recursion shows that the desired 
calculation of P[o\X^ is obtained from the summation of S independent final forward 

variables, Ujif^'s which realize the observation sequence [Rabiner, 1993]. 
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a,{l) 



TIME -► t-\ t t + l 


OBSERVATIONS —► Oj_^ <^r+i 

Figure 4.6. The Lattiee Strueture Used to Derive the Forward Reeursion (After 

[Deller, 2000]). 


The proof of the Forward Reeursion is provided from [Wang, 2005] by 
using the HMM assumptions [Eqs. (4.4) through (4.7)], the prineipal of eonditional 
probability given as: 

p[a,B\X) = p[a\B,X)p[b\X) = p[a\X)p[b\A,X), (4.16) 

and the prineipal of total probability expressed as: 

R(A)=X^(A5). (4.17) 

All B 
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«r+i(7) = P(o„o^,...,o,,o,^„q,^, = j\X), 

= P(o^,o^,...,o„o,^^ \q,^^ = j,X)P[q^^, = j\X), [From Eq. (4.16)] 

= P(oi, 02 ,...,o, \q,^, = j,X)P[o,^, \q,^, = j,X)P[q^^, = j\X), [From Eqs.(4.6)] 
= P{o„o^,...,o^,q,^^ = j\X)P[o,^,\q,^, = j,X), [FromEq. (4.16)] 

= P(o„o^,...,o^,q^^, = j\^)b(o^^, I 7 ) 


j 

Y,P[o„o^,...,o„q,=i,q,^, = j\X) b[o,^,\j), [FromEq.(4.17)] 

(=1 J 

5 1 

Y,P(o„o^,...,o„q,=i\X)p[q,^, = j\o^,o^,...,o„q,=i,X) ^( 0,^1 |j), [FromEq.(4.16)] 

/=1 J 

5 1 

Y,P[o„o^,...,o„q,=i\X)P[q,^, = j\q,=i,X) ^( 0,^1 1 ;), [FromEq.(4.4)] 

/=1 J 

Y,cc,{i)a[i\i) b[o,^,\j) [FromEq.(4.12)]. 


It should be noted that there are N multiplieations involved in the first 
step and 5(5+l)(5-l) multiplieations involved in the seeond step. There are no 
multiplieations in the third step. Therefore, the Forward Reeursion results in a total of 
5 + 5 ( 5 + l)(r-l) « multiplieations in eomparison with 2T-N^ multiplieations 

required by the direet evaluation approaeh [Dugad, 1996]. Consequently, the Forward 
Reeursion provides a praetieally feasible method for the eomputation of the desired 

probability . 

Next, the backward variable /3^{i) is defined in a similar way as was done 
with the forward variable as follows: 

P,id) = P(,OM^o,^2^-^Oj\q,^i,X). (4.18) 

/3,{i) is defined as the eonditional probability of the observation sequenee from the 
instant t +1 to the final observation instant T given the state i at the instant t and the 
model X. The probability P[o\X^ ean be evaluated by eomputing P,{i) with iterations 
whieh are baekward in time, henee ealled the Backward Recursion. 
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Although applying the Forward Recursion is enough for the computation 
of P{o\X^, a combination of forward and backward procedures becomes useful when 

dealing with the training problem. Thus, the Backward Procedure is presented, as given 
in [Dugad, 1996]. 

1. Initialization; for 1 < / < 5, 

f3,{i) = \, (4.19) 

2. Recursion: for t = r-l,r-2,...,2,1 and \<j<S, 

A (0 = Z ^ 1*)^ li) ( i) ’ (4.20) 

3. Termination: for 1 < / < 5, 

= (4.21) 

(=1 

The computation of the probability Pi^O\X^ by using the Backward 

Recursion results in multiplications on the order of S^T just like the Forward Procedure; 
thereby providing an equally efficient method. 

2. Solution to the Decoding Problem 

The solution to Problem 1 computes P{o\X^ by summing up all possible state 

sequences through a Trellis structure in a HMM, hence it does not result in the optimal 
state sequence that accounts for the observation sequence. In Problem 2, the single most- 
likely state sequence (path) Q* is sought for a given observation 

sequence and model. In other words, the goal is to find the optimal path that satisfies 
2 * = argmax P(2,0|/l] , considering that maximizing the probability P{^Q\0,X\ is 

equivalent to maximizing P[Q,0\X^. The Viterbi Algorithm can be used to determine 

such an optimal state sequence as a dynamic programming method applied to HMM 
[Rabiner, 1993]. 
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a. The Viterbi Algorithm 

The Viterbi Algorithm ehooses and keeps the most-likely state sequenee 
for eaeh of the S possible states at eaeh instant reeursively unlike the Forward Reeursion 
whieh sums over the transitions from all possible ineoming states reaehing the same 
destination state. Onee the Viterbi Algorithm reaehes the final step at the end of 
observation sequenee, it uses baek-traeing to find the most probably path. 

Following reformulation of the deeoding problem provides an insight into 
the Viterbi Algorithm as it is applied to the estimation of the optimal state sequenee in a 

HMM [Dugad, 1996]. Ultimately, the goal is to maximize the probability P{Q,0\X^. 

The prineipal of eonditional probability is employed first using Eqs. (4.10) and (4.11) to 
obtain 

P(0,Q\X) = P(0\Q,X)p(q\X) (4.22) 

= \q,)a[q^ \qi)b{o^ \q^)...a[q^ \Qt)- (4-23) 


Then, a distanee funetion assoeiated with any possible state sequenee 
along the path from t = 1 up to t = T is defined by taking the negative logarithm of Eq. 
(4.23) as follows: 


Dt D{^q^,q2,---,qj^ 


i 

\og[n^b[o, |<?i)) + ^ log (a \q,_,) b {o, \q ,)) 

t=2 


. (4.24) 


Now, observe that the maximization problem to obtain the most-likely 
state sequenee given in the following equation 


2*=argmax P{Q,0\X\ fort = l,2,...,r 

All q, ^ 


(4.25) 


turns out to be a minimization problem over the distanee funetion as expressed in 


2’ =argmin[D(<7j,<72,...,<7j.)]. 

All 


(4.26) 
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The term -\og{7t^b{^o^\q^m Eq. (4.24) is called the initial weight of 

having observation Oj at the initial state . The weight of a given state sequence is equal 
to the summation of all the weights along the path. In a similar way, the term 
-log^a( 7 |/)h(o,in Eq. (4.24) corresponds to the cost associated with going from 

state i to state j at time t. Note that the weight function is associated with a state 
sequence, whereas the cost function is associated with a state transition only. 

There are two expressions used in the Viterbi Algorithm. The term (/) 
stands for the accumulated weight along the path as far as proceeding to state i at time t. 
The term denotes the state at time t-l whose transition to state j at time t has 

the minimum corresponding cost. It is necessary to keep track of the latter argument so as 
to backtrack and decode the optimal state sequence in the end. It is now possible to 
present the Viterbi Algorithm as follows; 

1. Initialization; fori < / < 5 


<5',(i) = -log( 2 ',)-log(*(o,|i)), 

(4.27) 

T,(;) = 0, 

(4.28) 

Recursion: for2<t<T and 1 < j<N 


( j) = min [S,_, (i) - log [a (j |/))] - log (h(o, | j)), 

(4.29) 

'^,{j) = argmm S,_^{i)-\og(a(j\i)) , 

l</<5 ^ / J 

(4.30) 

Termination: 



(4.31) 

q*j = arg min (/)], 

l<i<S 

(4.32) 


4. Backtracking: for t =r-l,r-2,...,2,l 
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(4.33) 


The argument P*, the negative log probability of the path with the 
minimum weight, is called the log likelihood. Although the expression exp(^-P*^ equals 

the desired probability strictly, the argument of log likelihood is preferably used instead 
of the regular probability for convenience in computations without any conversion. Note 
that small log likelihood measures correspond to large probabilities. In addition, taking 
the negative logarithm is a means of circumventing the precision and underflow problems 
which occur when small probability values are multiplied iteratively. 

A little thought over the steps involved in the Viterbi Algorithm reveals 
that the computational requirement for finding the most-likely path is, similar to the 
Forward Procedure, on the order of S^T additions. 

Figure 4.7 illustrates that the Viterbi Algorithm operates in a Trellis 
structure in a similar way as that followed in the Forward Recursion. Bold arrows 
represent the choice of the most-likely state at each time step and the resulting best state 
sequence is shown as Q* = 53 , 52 , 52 } in the three-state HMM in Figure 4.7. 


TIME 



OBSERVATIONS —► 6>j 6>2 O3 


Figure 4.7 Trellis Structure Illustrating the Viterbi Algorithm for a Three-State HMM 
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3. Solution to the Training Problem 

The training problem deals with adjusting (or re-estimating) the model parameters 
/l(A,5,;r) in an effort to learn and eneode the eharaeteristies of the observation 

sequence into the model such that the model should identify a similar observation 
sequence in future [Dugad, 1996]. 

Unfortunately, optimizing the training parameters of a HMM is difficult to 
address as there is no known analytical method that maximizes the joint probability of the 
training data in a closed form [Rabiner, 1993]. Two of the most popular methods to 
handle this problem are discussed. First, it is possible to optimize the parameters 
/l(A,fi,;r) such that the likelihood of occurrence for the training data P[o\X^ is locally 

maximized by using the iterative Baum-Welch algorithm, also known as the Forward- 
Backward algorithm [Baum, 1966; Baum 1970]. Second, the problem can be solved by 
first applying the Viterbi algorithm, and next maximizing the likelihood P{o\X^ over the 
single most-likely state sequence. 

a. The Baum-Welch Re-Estimation Algorithm (B-WAlgorithm) 

This method starts with an initial arbitrary model X and searches for a 

new model X which improves the probability that the given observation sequence is 
generated by the new model, at each iteration until a maximum is reached, such that; 

P[0\X^>P[0\X). (4.34) 

This optimization criterion is known as the maximum likelihood criterion. Some new 
notations from [Rabiner, 1993; Dugad, 1996] are first introduced to explain the B-W re¬ 
estimation algorithm. The term Y,{i) denotes the conditional probability of being in state 

i at time t, given the full observation sequence O = { 0 ^, 02 ,...,o,,...,Oj] and the model 
X[A,B,7i), as defined in the following form: 

r,{i) = P(q,=i\0,X). (4.35) 
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By using the definitions of the Forward variable a, (/) from Eq. (4.12), 
the Backward variable (5^ (/) from Eq. (4.18) and the Bayes Rule, the following relation is 
obtained; 


P{q,=i,0\X) 

' P{o\X) P{o\X) 


(4.36) 


The conditional probability of being in state i at time t and making a 
transition to state 7 at time t + 1 , (/,j),is also defined, given the observation and the 

model, i.e.. 


^,{Uj) = P(qt=Uq,^x = jp^^)- (4-37) 

By using the forward and backward variables and the Bayes’ Rule, the 
probability (/, j) can be written in the form [Rabiner, 1993]; 




p(qt=Uq,^x = 

P(0\X) 

P(0\X) 

1=1 j=\ 


(4.38) 


The derivation involved in Eq. (4.38) can be seen intuitively in Eigure 4.8, 
which illustrates how the probability is computed. In this illustration, a, (/) and 

account for the partial observation sequences {oj, 02 ,...,oJ and 

, respectively, for completeness, j\i^ explains the transition from 

state i to state 7 and represents the probability of selecting the symbol 

from state 7 . All these probabilities are multiplied to compute the numerator term in Eq. 
(4.38) because their respective events are statistically independent. Summation of these 
products over all possible states 7 and i accounts for the denominator term in Eq. (4.38). 
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Figure 4.8. Illustration for the Computation of the Probability (/, j) (After [Rabiner, 

1993]). 

It should also be pointed out that y, (/) is related to ^ (/, 7 ) by summation 
over j as shown below: 

7=1 

Summation of y, (i) over the observation time t provides a useful quantity, the expeeted 
number of transitions from state i 

T-l 

Expeeted number of transitions from state/. (4.40) 

t=i 

By way of similar reasoning, the expeeted number of transitions from state i to state j is 
obtained by summing (/, 7 ) over the time index t 

T-\ 

[i, 7 ) = Expeeted number of transitions from state i to state 7 . (4.41) 

t=i 

Now that all the underlying eoneepts have been diseussed and their 
respeetive notations given, Eqs. (4.40) and (4.41) ean be used in the Baum-Weleh re¬ 
estimation formulas as follows [Rabiner, 1993]: 
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(4.42) 


Tt = Expected number of times having state i at time t = \. 



Expected number of transitions from state i to state j 
Expected number of transitions from state j 

m-j) 

/=! _ 

r-1 

Zn 

t=l 


(4.43) 




Expexted number of times in state j and observing symbol o, =k 
Expected number of times in state j 



(4.44) 


Based on the formulas given in Eqs. (4.42) through (4.43), an updated model i is 
obtained until the new model satisfies the condition: 


p[o'\i^-p[o\X)>s, 


(4.45) 


where the tolerance s denotes a very small number. 

However, the B-W Re-estimation algorithm improves towards the nearest 
maxima in the proximity of the initial model parameters owing to the complex nature of 
the joint likelihood function of the training set P{^0\X^ which has many local maxima. 

One way to circumvent this problem is to repeat the B-W re-estimation algorithm a few 
times with different random initial model parameters and select the best re-estimated 
model which returns the maximum P[o\X^. 

b. Viterbi Re-Estimation Algorithm 

The Viterbi decoding algorithm has been applied to ascertain the optimal 
hidden state sequence so far. This algorithm can also be applied to the training problem to 
re-estimate the model parameters over the most-likely state sequence, unlike the B-W re¬ 
estimation algorithm which works over all possible state sequences. 
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In a left-to-right HMM set-up, the initial state can be designated as state 
one to save from re-estimating the initial state probabilities. First, the optimal state 

sequence Q* is estimated and then the probability p[o\^Q *is evaluated. The 

following arguments are also tracked along the most probable path [Deller, 2000]; 

j I i^= Number of transitions from state i to state j along the path Q*, (4.46) 


h) = Number of transitions from state i along the path Q*, 

n{j\ •) = Number of transitions to state j along the path Q*, 

n(^o, =k,q, = j) = Number of times observation k and state j 
occur concerrently along the path Q*. 


The Viterbi re-estimation formulas are presented as follows: 




(4.47) 

(4.48) 

(4.49) 


(4.50) 


(4.51) 


Note that although the Viterbi re-estimation algorithm seems simpler and 
more computationally-efficient than the B-W re-estimation, it does not incorporate all 
possible state sequences into the model update. 

C. HMM IMPLEMENTATION ISSUES 

This section discusses two important implementation issues for training and 
recognition of HMMs. First, there is a numerical underflow problem inherent in the 
computation of the B-W re-estimation formulas. Second, it is necessary that HMMs be 
trained by multiple observation sequences so as to obtain a more rigorous representation 
of the statistical variations inherent in the observations (or utterances of words for IWR 
purpose.). 

I. Scaling the B-W Re-estimation Formulas 

It has been demonstrated that the B-W re-estimation formulas given in Eqs (4.42) 

through (4.44) can be used to update the model parameters. A potential problem with the 
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forward, backward variables and the B-W re-estimation formulas is that as the time index 
t inereases, the large number of multiplieations of probabilities between 0 and 1 may 
exeeed the preeision range of numerieal eomputations, henee eausing underflow 
problems. As a result, it beeomes neeessary to seale the forward and baekward variables 
to eireumvent this problem [Rabiner, 1993]. First, the re-estimation formula for 

must be rewritten in terms of the forward and baekward variables by using Eqs. (4.38) 
and (4.39). 

^(j\i) = Y^s -• (4-52) 

^ 5] «, (/) a (7 \i)b(o,^, I ( j) 

t=i j=i 

Next, the forward variable (/) is sealed by the sealing eoeffieient , that is: 

-, (4.53) 

!!«,{}) 

j=l 

= (4.54) 

J=l 

where the notation a, (/) denotes the sealed version of the forward variable. Similarly, 
the baekward variable is sealed and shown as 

AU) = cAU)- (4.55) 

Note that the eomputations are kept within a reasonable dynamie range and the 
following sealed re-estimation formulas obtained by replaeing the un-sealed forward and 
baekward variables with the sealed variables in B-W re-estimation formulas, resulting in 
[Rabiner, 1993]: 
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a(j\i) = Y^s -1-’ (4.56) 

Y,^a^{i)a(j\i)b(o^^, \ ( j) 

t=i j=i 


b(k\j) = ^, -. (4.57) 

Z«^(-/‘)A(7) 


Furthermore, the probability P[o\X^ ean be eomputed in terms of the sealing 
faetor c,. The following property is introdueed first [Rabiner, 1993]: 


nc,X«r(>') = l- (4-58) 

t=\ 1=1 


From Eqs. (4.15) and (4.58), 


Y\c,p(o\X)=\, 


and 

P(o\X)=^. 


By taking the negative logarithm on both sides of Eq. (4.60), the log likelihood funetion 
is eomputed given as [Deller, 2000]: 


log[p(o|2,)] = -j]logc,. (4.61) 


Einally, it should be noted that the Viterbi deeoding and re-estimation algorithms, 
as deseribed in the previous seetions, are not inherently suseeptible to an underflow 
problem due to the logarithmie operations involved. Therefore, it is not neeessary to seale 
the Viterbi algorithm. 


(4.59) 

(4.60) 
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The MATLAB implementation for the sealed B-W re-estimation algorithm is that 
available at [Sorensen, 2005] and given in Appendix C.l. The MATLAB implementation 
for the Viterbi training algorithm is provided in Appendix C.2. 

2. Training with Multiple Observation Sequences 

The training algorithms and the method of sealing the re-estimation formulas for 
HMM so far have been diseussed, provided that there is only a single observation 
sequenee of interest O = available. However, training a HMM using a 

single observation sequenee is not praetieal in most real-world applieations beeause it is 
possible to build a more rigorous model whieh eneodes many statistieal variations of the 
same event to be modeled by using multiple observations. Sueh a model is more likely to 
extraet a set of eharaeteristies relevant to the reeognition of a partieular word. Therefore, 
a modifieation to the re-estimation proeedure is essential to aeeount for multiple 
observations [Rabiner, 1993; Deller 2000]. 

Thus, the study uses a left-to-right model in whieh state transitions from state 1 at 
time t = \ to state S at time t = T oeeur sequentially, assuming that there are 
L observation sequenees denoted as, 

O = (4.62) 

where 0‘'‘^ = represents the observation sequenee. By assuming that 

all observation sequenees are statistieally independent, the joint probability of L 
observation sequenees results in 

= (4.63) 

i=\ 

Reeall that the re-estimation formulas are ealeulated by enumerating the 
oeeurrenees for eaeh event. Henee, the unsealed version of the re-estimation formulas for 
multiple observation sequenees should add up the expeeted number of oeeurrenees for 
eaeh individual observation sequenee as follows: 
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1 





u 

I 

/=1 




Z«'(Oa'(« 


(4.64) 


and 


^(fc|j) 


^ 1 
l^iP[0^'^\X) 


li «'(j)A'(i) 

/=! 

o\^^=k 


^ 1 


Z«'(./‘)A'(j) 


(4.65) 


Equation (4.60) is modified to obtain the relation between cf ^, the scaling factor 
for the /'* observation and P{o^'^\X^ for the multiple-observation scenario in Eq. (4.66) 


P[0^‘^\A)= fj 


f T, 


uo 


(4.66) 


V f=i J 


Rewriting Eqs. (4.64) and (4.65) in terms of the scaled variables a\{i) andy9/(/) the 

1 


scale factor and the term 


causes each other to be canceled out of the 


nominator and the denominator, due to the relation given in Eq. (4.66). Consequently, the 
modified re-estimation formulas in terms of scaled variables become; 


L T,-\ 


£Z«' \j)Pl,{i) 

U)=— 


L r,-i 


(4.67) 


ZZ«'( 0 a'(« 


l=\ t=\ 


L Tj-\ 


Z Z «'(7)A'(j) 


b{k\j)^- 


l=\ t=\ 


o\^^=k 


S 

i=\ 


T,-\ 


(4.68) 


p[o^'^\x) 


Z«'(7)A'(7) 


t=\ 


and 
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(4.69) 




L 


I 


a'[o^,i) fi'[o^,i) 


Note that it is not necessary to re-estimate n. in a left-to-right HMM structure by 
assuming ;rj =1 and =0 when/;^!. Finally it should be pointed out that the same 

rationale behind the modification of B-W re-estimation formulas to account for multiple 
observation sequences also applies to the Viterbi re-estimation approach. 

The MATLAB implementation for the HMM training with the multiple 
observations (speech feature vectors) is that available at [Sorensen, 2005] and presented 
in Appendix C.3. 

D. ISOLATED WORD RECOGNIZER IMPLEMENTATION USING HMMS 

Hidden Markov modeling was discussed, as well as the underlying assumptions of 
HMMs, the model parameters, their common structures and the basic implementation 
issues. Also presented were two popular training algorithms for estimating optimal model 
parameters in this chapter. From an acoustic signal modeling point of view, a HMM can 
be considered a combination of two stochastic processes. These are a hidden Markov 
Chain, which represents the temporal variability of an acoustic signal in the hidden state 
sequence, and an observable process which explains the spectral properties of an acoustic 
signal. Based on this background knowledge, an isolated word recognizer using discrete 
symbol HMM classification is now introduced, so as to identify seven distinct sets of 
acoustic signals collected from within the external ear canal. 

The discrete symbol HMM works with a finite set of integer symbols from a 
codebook. The K-means clustering algorithm makes it possible to generate such a 
codebook from the feature vectors of a training data set. The discrete observation 
sequences are derived from the indices of the codebook by using the feature vectors of a 
designated word from the vocabulary. The goal is to identify the designated unknown 
words that these observation sequences represent, by using Hidden Markov modeling. A 
left-to-right HMM structure attempts to model the sequential pattern Itkely to be present 
within the symbols of a given observation set. Needless to say, there are temporal 
variations in the nature of speech utterances. Thus, it should be pointed out that a left-to- 
right model is more appropriate for isolated word recognition [Rabiner, 1993]. An 
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additional common restriction on such a model strueture is that no more than two-state 
jumps at a time are allowed in forward transitions. This is also known as a Bakis model 
[Deller, 2000], 

Therefore, the model strueture ehosen for reeognition is a left-to-right HMM with 
diserete observation symbols. It is also noteworthy to point out that even an ergodie 
HMM trained with speeeh utteranees often fits into a sequential strueture eventually with 
baekward transition probabilities turning out to be zero [Deller, 2000]. 

Unfortunately, there is not a widely-aeeepted method to determine the proper size 
of the model, that is, the number of states neeessary to eneode the relevant eharaeteristies 
of the observations into the model so as to result in the best reeognition aeouraey. 
Needless to say, the relationship between the number of states and the performanee of 
HMM is erueial. 

One of the basie linguistie units eommonly used in Hidden Markov modeling is 
the phoneme. As a matter of faet, a phoneme is nothing more that an abstraet unit that 
might have various aeoustie forms from one eontext to another or sometimes from 
speaker to speaker due to the naturally oeeurring variations in the utteranee. 

Some researehers suggest the number of states be at least equal to the number of 
phonemes eontained in the designated word. Some others report it is better to assign two 
or sometimes three states per phoneme taking into aeeount the phoneme itself and the 
transitions to and from it [Deng, 2003; Beeehetti, 1999]. In most real-world HMM 
applieations, experimental results suggest the proper model size [Deller, 2000]. Seleeting 
too small a number of states would result in poor elassifieation due to the laeking 
eapability in modeling properly. Also, it should be noted that large models suggest 
inereased eomputational and memory eost. In the training phase, model sizes of five 
through eight states were experimented with to determine the proper model size. This 
study’s experimental results showed that eight states are suffieient to model the linguistic 
units of phonemes present in the voeabulary. The assumption is that the number of states 
should eorrespond roughly to the average number of distinet sounds (phonemes or 
syllabuses) in eaeh word. Thus, an eight-state HMM was seleeted. 
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There are mainly two approaches to implement an isolated word recognizer using 
HMMs. The first technique models each word in a small vocabulary with a separate 
HMM. The second is more appropriate for larger vocabularies whose goal is to model 
subword basic units such as phonemes, diphones or triphones. In this method, all these 
models should be combined to account for the whole word. The former technique was 
chosen by modeling each one of the seven words in the vocabulary by a separate HMM 
as shown in Figure 4.9. 
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Figure 4.9. Block Diagram of the Isolated Word HMM Recognizer ([After (Rabiner, 

1993)]. 


The procedure followed in the isolated word recognition may be divided into two 
steps [Rabiner, 1993]; 

1. First, a separate HMM is built for each word in the vocabulary, where, 
each word model has the same number of states. This step involves the 
training phase to estimate the model parameters =[A,B,7r) using 
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either the B-W re-estimation teehnique or the Viterbi re-estimation 
algorithm deseribed in Seetion (IV.B.3). 

Second, the trained HMMs are used to identify each unknown word in the 
testing data set. The recognition phase involves the computation of the 

likelihood for all possible models given the observation 

sequence belonging to the designated unknown word. Specifically, the 
Forward-Backward recursion described in Section (IV.B.l) is used to 

calculate for the trained seven models and then the index of the 

model with the maximum likelihood is identified as the designated word 
such that: 


w* = arg max 



l<w<7 


(4.70) 


In training the models, both the Baum-Welch re-estimation and the Viterbi re¬ 
estimation algorithms were used for comparison. The Baum-Welch re-estimation 
algorithm outperformed the Viterbi re-estimation algorithm by resulting in slightly better 
recognition rates after the training. Therefore, the B-W re-estimation scheme, which uses 
multiple observation sequences as described in Section (IV.C.2), were chosen to train the 
models. This scheme implements training using scaled forward and backward variables to 
prevent underflow problems in numerical computations. Initial model parameters 
(A,5,;r) are randomly chosen before training. Then, these models are trained until they 

have an optimized set of parameters, which produces the maximum likelihood 

for the given observation sequence of the designated wordw. It was observed that these 
models do not necessarily converge to the same parameters when the training for each 
model is repeated. 

In recognition, the MATLAB implementations available at [Sorensen, 2005] were 
used. The MATLAB routine in the Appendix C.4. computes the log likelihood function 
for a given observation sequence and trained model parameters. The MATLAB routine in 
Appendix C.5. implements the HMM-based recognition by selecting the model which has 
the maximum log likelihood as the identified word. 

The whole data set was consisted of 50 repetitions for each of the seven words 
{left, right,up, down, kill, pan, move] spokQn by 20 adult subjects. Two thirds of the data 
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set was randomly selected for training and the remaining one third of the data was used 
for testing. 

The next chapter starts with a brief overview of the entire isolated word 
recognition system, describing each step from the in-ear microphone data collection to 
the classification of the words under consideration. Finally, HMM-based average 
classification results are presented for the isolated word recognition system used in this 
study. 
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V. HMM CLASSIFICATION RESULTS 


This chapter presents a system overview of the speaker independent, small- 
voeabulary, isolated word HMM reeognizer before diseussing elassifieation results. 

A. SYSTEM OVERVIEW 

The main goal of the study was to reeognize a set of seven spoken words, in 
whieh the aeoustie signals were eolleeted within the external ear eanal using an isolated 
word recognizer. The small vocabulary consists of seven words: {left, right, up, down, 
move, pan, kill) and the speech database was generated using 20 adult subjeets who 
uttered eaeh word 50 times by wearing the in-ear mierophone discussed earlier. A 
diserete-symbol HMM reeognizer was ehosen beeause the voeabulary under 
eonsideration is small and eonsists of short words. The isolated word reeognition system 
assumes that the speeeh signal is a realization of some message eomposed of basie 
subword units, e.g., phones (or phonemes), which can be eonsidered as a sequence of 
symbols from a unique codebook. Figure 5.1 represents the overall block diagram of the 
isolated word HMM reeognizer used in the study. 


Observation sequence 
0 = {0^,02,. 


\ 


Analog Speech 


\ 


Crcqjped signal 


Recording 

Via 

Ear-Mic 


A/D 

Converter 


Front-end 
' Detector 


Feature 

Extraction 


Codebook 

generated 

i 


Vector 

Quantization 


Training 


7 


Speech Signal, s(n) 


Feature vectors 

F = {f„f„...,fd 


Testing 


/ 


Observation sequence 
O = io,,Oo,...,OT] 


/HMM\ 

for 

V Word ! / 


Probability 

Computation 


'^HMMN 

for 


P{0\M,) 


’ for 

Vword 1J 


Probability 

Computation 


Recognized 

Word 

argmax,. 


Wordy 


/ 


1 

_^ 






Maximurr 

Probability 




Computation 



Detector 

• 







Figure 5.1. System Block Diagram of the Isolated Word HMM Recognizer (After 

[Sorensen, 2005].). 
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The IWR system ineludes the following six phases: 

1. Recording step: Audio reeordings containing the utterances were first 
collected using the in-ear microphone in an earlier study by [Newton, 
2006] as described in Section I.C. Recall that a total of 50x20x7 = 7000 
isolated utterances of seven words were collected within the external ear 
canal. In addition, 1,000 isolated utterances of the word “Kill” were 
collected from the same 20 adult subjects within the external ear canal 
while a piece of loud music was played in the same room to simulate noisy 
environment conditions. This dataset was named “Kill noise” in the 
database. Finally, 400 isolated utterances of the word “Right” were 
collected from the same adult subjects with the in-ear microphone placed 
outside the mouth to investigate potential distortions due to the ear canal. 
This dataset was named “Right outside.” 

An A/D converter digitized the speech recordings with 8 kHz sampling 
rate and 16-bit quantization scheme. Next, the digitized speech was 
transferred to and stored in a notebook PC via the PCMCI card (DAQ 
Card-Al-16E-4 by National Instruments). 

2. Front-end detection step: One of the hardest challenges encountered in 
this study was to accurately extract the speech utterances from the audio. 
First, the mean of the digital signal was normalized to zero, so as to 
eliminate any bias introduced by the microphone. Second, the signal was 
passed through a high-pass filter with a passband of 100 Hz to 4000 Hz in 
order to remove equipment-related disturbances such as 50/60 Hz 
humming noise. Third, a search scheme based on short-time energy and 
zero-crossing rate was used to identify beginning and end of each 
utterance, as discussed earlier in Chapter II. A rectangular window was 
applied and the signal partitioned into 50% overlapping frames of 10 ms 
for short-term processing, as discussed in Chapter II. Note that a second 
detection algorithm using frame-based modified Teager’s energy was also 
considered [Ying, 1993]. However, the latter scheme produced 
comparable results to the first detection scheme at a higher computational 
load, and as a result, was not used in the final speech endpoint detection 
phase. Finally, the speech portion of each signal was cropped and stored 
based on the estimated boundaries. 

3. Feature extraction step: Feature extraction was performed to obtain the 
spectral and temporal characteristics of the speech waveform in a compact 
representation. First, the signal was divided into frames of 256 samples 
(corresponding to 32 ms) with an overlap of 100 samples (roughly 
corresponding to 40% overlapping) from frame to frame and a Hamming 
window applied to each frame. Two types of speech feature were 
considered: LPC-derived Cepstral Coefficients (LPC-CCs) and Mel- 
frequency Cepstral Coefficients (MFCCs). Results conducted on a small 
portion of the database showed classification results obtained with MFCC 
parameters were better than those obtained with the LPC-CC parameters, 
thus the preference was to use MFCC parameters as speech features for 
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the study. A filter bank of 24 triangular filters was used for the mel-scale 
eonversion in the MFCC eomputation. The first 12 MFCC (statie 
parameters) and 12 delta-MFCC (dynamie parameters) parameters were 
extraeted from eaeh frame of the eropped speeeh signal. Finally, the mean 
of the statie MFCC parameters was normalized to zero and the delta- 
MFCC parameters were sealed to the range of the statie MFCC parameters 
by multiplying them by a faetor of six. 

4. Vector quantization (VQ) step: Two thirds of the database was seleeted 
randomly to form the training set, while the reminder was used as the 
testing test. Feature vectors were generated from the training set and used 
to generate a codebook of length equal to 128 with the K-means 
clustering, as discussed in Chapter III. Note that a randomly generated 
codebook was also considered in an effort to significantly reduce the 
computational load by eliminating the K-means algorithm training phase. 
Kinnunen et al. reported that there is “no clusters” in the speech feature 
space based on the Euclidean distance metric and the cepstrum-based 
parameters (RCC, MFCC and LPCC) [Kinnunen, 2001]. This research 
study concluded that the role of vector quantization is only to remove 
redundancy from speech features and detecting clusters does not play a 
key role in codebook generation. Therefore, any fast method of choosing 
uniformly-spaced linear code vectors among the sample feature vectors 
should establish a codebook and work out for speech recognition. 
Motivated by the findings from [Kinnunen, 2001] a random codebook was 
considered using 256 symbols uniformly spaced in the range of speech 
features. However, experimental results showed that this linear codebook 
resulted in poor classification performances, and was not considered 
further. Note that our results did not agree with the findings in that earlier 
paper by [Kinnunen, 2001]. It is surmised that the differences could arise 
from the in-ear microphone data characteristics due to low-pass filtering in 
the ear canal. Both training and testing feature sets were vector-quantized 
and provided inputs to the DHMM classifier set-up, as explained next. 

5. Discrete-symbol Hidden Markov Model training step: A discrete-symbol 
HMM was selected with eight states for each word in the vocabulary. The 
training sequence of vector-quantized symbols was used to train each 
model and the model parameters that maximize the likelihood of the 
training set were estimated using the Baum-Welch re-estimation, as 
discussed in Chapter IV. Also applied was the Viterbi training algorithm 
for comparison and through experiments it was discovered that the Baum- 
Welch re-estimation algorithm resulted in better classification rates than 
those obtained with the Viterbi algorithm for the data considered. 

6 . Hidden Markov Model recognition step: Recall that endpoint detection, 
feature extraction and vector quantization were performed for each word 
in the testing set, as illustrated in Figure 5.1. Once model parameters are 
determined from the training set, the recognizer is designed to decide that 
unlabelled data is most likely to have come from a specific model M. by 
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selecting the model which maximizes the a-posteriori probability that the 
data was generated by the given model, i.e., 


arg max 




for i = 1,2,...,7. 


(5.1) 


That is, the recognizer identifies the model, which is most-likely, given the 
discrete observation sequence. 

Next, the overall HMM classification results obtained with the data under 
consideration are presented. 

B. DHMM RECOGNIZER CLASSIFICATION RESULTS 

Eighty experiments were conducted to investigate the overall classification results 
obtained with a DHMM classifier approach on the database, where two thirds of the data 
were randomly selected for the training phase and the reminder one third of the data were 
used for testing. Several models were considered by varying the frame length, the number 
of MFCC coefficients, and implementations with and without the delta-MFCC 
coefficients. The combination leading to the best average classification results is shown 


below; 

a. 

Frame length: 

256 samples. 

b. 

Increments: 

156 samples (40% overlapping frames). 

c. 

Speech features: 

12 MFCCs and 12 delta-MFCCs. 

d. 

Mel-scale filter bank: 

24 triangular filters. 

e. 

Delta-MFCC scaling factor: 

6 . 

f 

Codebook size: 

128 symbols. 

g- 

Model size of HMMs: 

8 states. 


First, the overall average classification results for the isolated words under 
consideration are discussed. Second, the experimental findings for the second dataset 
(Kill noise and Right outside) are presented to show the recognition performance in 
noisy environment conditions and the potential distortions due to the ear canal 
investigated. Finally, the timing issues regarding the simulation run-times were 
examined. 

I. Overall Recognition Results 

Results show the overall classification rate to be equal to 92.77%. Table 5.1 
shows the confusion matrix of the average classification results obtained on the testing 
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sets over 80 experiments. Figure 5.2 plots the average elassification rates obtained for 
each word. 


WORD 

AVERAGE( 

CLASSIFICATION RATES FI 

□R THE TESTING SET (%) 

Up 

Down 

Left 

Right 

Kill 

Pan 

Move 

Up 

92.5500 

3.2125 

2.3375 

0.4375 

0.3458 

0.8208 

0.2958 

Down 

0.3042 

92.3333 

1.7958 

0.3250 

2.3292 

2.3500 

0.5625 

Left 

2.1375 

1.6875 

87.6833 

7.0792 

0.6542 

0.3625 

0.3958 

Right 

0.4875 

1.0750 

2.6417 

94.6708 

0.4000 

0.3792 

0.3458 

Kill 

0.3125 

1.7250 

0.7917 

0.4667 

94.2292 

2.2500 

0.2250 

Pan 

0.0917 

3.5500 

0.9375 

0.3375 

1.8708 

91.9458 

1.2667 

Move 

0.6875 

0.1000 

1.4542 

1.2875 

0.2917 

0.2167 

95.9625 


_ Average Classification: 92.77 % _ 

Table 5.1. Confusion Matrix for the Average Classification Rates for Testing Sets; 80 
Experiments, 62.5% Training, 37.5% Testing. 


Average Classification Rates for the Testing Set 
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Up Down Left Right Kill Pan Move 

Vocabulary Words 


Figure 5.2. Average Classification Rates for Testing Sets; 80 Experiments, 62.5% 

Training, 37.5% Testing. 
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Table 5.2 lists the resulting 95% confidenee level intervals for the averaged 
classification results obtained for 80 experiments on testing sets. Figure 5.3 illustrates the 
95% confidence level intervals and the average classification rates for the testing set. 


WORD 

95% CONFIDENCE INTERVALS 
FOR TESTING SETS, in % 

Up 

[86.0000 - 96.0000] 

Down 

[86.0000-95.3333] 

Left 

[79.3333 -93.3333] 

Right 

[92.3333 - 97.0000] 

Kill 

[91.0000-97.3333] 

Pan 

[86.6667 - 96.0000] 

Move 

[93.0000 - 98.0000] 

Overall 

Classification 

[91.0952-94.2857] 


Table 5.2. 95% Confidence Level Intervals for Testing Sets; 80 Experiments. 


95% Confidence Intervals for the Testing Set 



Vocabulary Words 


Figure 5.3. 95% Confidence Intervals and the Average Classification Rates for 

Testing Sets; 80 Experiments. 
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Figure 5.2 shows that the miselassification rate (or the word error rate) of the 
word “Left” stands out as the worst among all words. In an effort to find the cause of the 
performance degradation for this word, the misclassilied “Left” utterances were traced 
back. We noted that miselassification errors occurred for trials where ending boundaries 
did not include the low energy “t” sound which was supposed to be present at the end of 
the word “Left.” As a result, we surmise these errors were mostly caused by the incorrect 
detection of the speech boundaries, due to the hard-to-detect unvoiced low-energy ending 
sound “t” in the word “Left.” 

2. Experimental Results for the Second Dataset 

Eighty experiments were also run with the second data set consisting of 1,000 
utterances of “Kill_noise” and 400 utterances of “Right outside.” Table 5.4 shows the 
average classification rates for the second data set while Table 5.5 lists the 95% 
confidence level intervals for the second data set. 


WORD 

AVERA 

1 

.GE CLASSIFICATION RATES FOR 

rHE SECOND DATA SET (%) 

Up 

Down 

Left 

Right 

Kill 

Pan 

Move 

Right_outside 

0.8747 

4.1858 

9.5897 

83.5146 

0.8969 

0.9192 

0.0191 

Kill_noise 

2.0082 

5.3058 

4.1641 

2.6784 

79.2431 

6.1595 

0.4409 


Table 5.4. Average Classification Rates for the Second Data Set; 80 Experiments. 


WORD 

95% CONFIDENCE INTERVALS 
FOR THE SECOND DATA SET 

Right_outside 

[63.1043 -94.6565] 

Kill_noise 

[72.6809 - 84.5056] 


Table 5.5. 95% Confidence Eevel Intervals for the Second Data Set; 80 Experiments. 

Eigure 5.4 illustrates the 95% confidence level intervals and the average 
classification rates for the second data set consisting of the words “Right outside” and 
“Kill noise”. Note that hidden Markov models were trained with the first data set of 
words {up, down, left, right, kill, pan, move} and tested on the second data set in these 
experiments. 
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Figure 5.4. 95% Confidence Intervals and the Average Classification Rates for the 

Second Data Set; 80 Experiments 

Recall that the isolated utterances of the words “Right outside” and “Kill noise” 
were not included in the training population which was used to estimate the model 
parameters. The following comments can be made. 

1. Average classification rates for the word “Kill” collected within the ear canal while 
the subject is placed in a severely distorted environment show the ear-canal and the 
foam casing noise limit recognition degradations over what would be observed from 
the same signal collected in front of the mouth. However, they also indicate that 
recognition performance still worsens, thereby showing the need for some noise 
canceling pre-processing to minimize these noise distortions prior to the recognition 
stage. 

2. Average classification rates obtained for the word “Right” collected outside the 
mouth shows significant performance degradation (down to 63%), over that 
observed with the same word collected from within the ear canal (down to 86%). 
Recall that the ear canal dampens high frequencies, as shown in Figure 2.1 which 
presents spectrograms of two trials of the word “Right” collected within the ear 
canal and outside the mouth, respectively. As a result, the HMM model derived for 
the data with dampened higher frequencies does not accurately fit the data collected 
outside the mouth, resulting in degraded recognition performances. 


100 




3. Timing Issues 

Table 5.3 shows the average training times per utterance of a word for the 
following four computational tasks which are part of the classifier set-up: 1) the K-means 
clustering step which transforms the continuous valued feature vectors into a codebook, 
2) the Baum-Welch re-estimation algorithm step which is applied to compute the HMM 
parameters, 3) the overall average training time, and 4) the average time needed to make 
a decision for a word from the testing set. The specifications shown are based on average 
MATLAB simulation run-times obtained with a PC with 3 GHz Intel Pentium IV 
processor and I GB of RAM. Table 5.3 shows that the K-means clustering algorithm step 
represents more than 50% of the overall training time needed to generate the codebook, 
while the computational time of the Baum-Welch re-estimation algorithm step needed to 
train one of the HMMs is about 44% less. Finally, we note that the recognition time on a 
per word basis is about 1/3 of that required at the training phase. 


SIMULATION TYPE 

AVERAGE MATLAB SIMULATION 
RUN-TIME (seconds) 

K-means algorithm training time 
per word: 

0.0771 

Baum-Welch algorithm training time 
per word: 

0.0431 

Overall training time per word: 

0.1202 

Recognition time per word: 

0.0428 


Table 5.3. Average Simulation Run-Times for 80 Experiments. 
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VI. CONCLUSIONS AND RECOMMENDATIONS 


This thesis study extended the earlier work by Newton [Newton, 2006] who 
eollected a database of isolated word utteranees of seven words from 20 adult subjects by 
using the in-ear microphone. The objective of this thesis was to study this database and 
demonstrate the feasibility of using acoustic speech signals collected within the external 
ear canal for isolated word recognition. The intent was also to investigate the 
performance and noise-robustness of the in-ear microphone data for speech recognition in 
noisy operational environments. These objectives were successfully accomplished by 
implementing a discrete-observation HMM-based isolated word recognizer in MATLAB. 
A. SIGNIFICANT RESULTS AND CONCLUSIONS 

The following specific conclusions were drawn through experimental results: 

• Three types of digital filters: a first-order pre-emphasis filter (FIR), a 
sixth-order bandpass filter (HR) and a sixth-order high-pass filter (HR) 
were implemented and applied to the database in the pre-processing step. 
The use of high-pass filter resulted in the best classification performances 
in the isolated word HMM recognizer. 

• Two different speech segmentation procedures for endpoint detection were 
studied. The first approach used the short-term energy (STE) and the zero¬ 
crossing rate (ZCR), while the second was based on the frame-based 
modified Teager energy. Results showed similar detection accuracies 
were obtained for both. The Teager energy approach was found to be 
computationally more expensive than the alternative, and as a result, it 
was not used in the detection/cropping step. However, we also noted the 
STE and ZCR detection approach still produced misdetection of word 
boundaries for some of the words with weak fricative sounds. 

• Two popular types of speech features, specifically LPC-derived Cepstral 
coefficients (LPC-CC) and Mel-frequency Cepstral coefficients (MFCC) 
were extracted from speech frames separately and their performances 
compared. The use of MFCC parameters for feature extraction resulted in 
better recognition rates. In addition, augmentation of the static MFCC 
parameters by the dynamic delta-MFCC parameters improved recognition 
rates. Finally, we also noted that Cepstral mean correction and delta 
parameters scaling had a significant impact for feature enhancement, 
improving the classification accuracy further. 

• K-means clustering was implemented for codebook generation and vector 
quantization of continuous valued speech features. We observed that 
increasing the codebook size diminished the quantization distortion. 
However, large codebook sizes (over 128 codewords) were not practical 
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for this study’s small-vocabulary recognition system, eonsidering the 
inereased training time assoeiated with a larger eodebook. We also 
investigated the use of a linear eodebook uniformly spaeed along the range 
of speech features, whieh resulted in poor clustering and elassification. It 
was observed that the K-means elustering algorithm with the seleetion of a 
suffieiently-large eodebook size ean be effeetively used to obtain diserete 
speech feature sequenees for training the diserete-observation HMMs. 

• The seleetion of proper HMM parameters sueh as the number of diserete 
symbols (or equivalently eodebook size) and the number of hidden states 
to set up the HMM plays a key role in reeognition performanee. Although 
there are only four phonemes per word in this study’s voeabulary on 
average, experimental results showed the state size of eight was required 
to model the speeeh eharacteristies of in-ear mierophone data for high 
elassifieation aeeuracy. 

The average elassifieation rate for this study’s small-voeabulary, multiple-speaker 
isolated word HMM recognition system was 92.77%. Results show that the acoustie 
signals originated from speech organs and collected within the external ear canal via an 
in-ear mierophone ean be effeetively used for isolated word reeognition. 

The seeond dataset eollected under low SNR eonditions with additive noise 
resulted in 79.24% recognition accuracy in the HMM-based elassifier. This result is 
eonsistent with the theoretical noise-shielding property of the human ear, thus justifying 
the use of in-ear mierophone data for speeeh enhancement and improved reeognition 
under low-SNR eonditions. 

Finally, we investigated using the in-ear mierophone outside the mouth and 
eomparing elassifieation results obtained for the data eolleeted within the ear canal and 
outside the mouth. Average elassification rates obtained for the data eolleeted outside the 
mouth shows signifieant performanee degradation (down to 63%), over that observed 
with the data eolleeted from within the ear eanal (down to 86%). Reeall that the ear eanal 
dampens high frequeneies. As a result, the HMM model derived for the data with 
dampened higher frequeneies does not aeeurately fit the data eolleeted outside the mouth, 
resulting in degraded recognition performances. 

These initial results are eonsidered highly eneouraging for the development of a 
human-maehine interfaee using the in-ear mierophone for the tele-operation of military 
robots in noisy operational environments. 
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B. RECOMMENDATIONS FOR FUTURE WORK 

Results indicate that there is still room for recognition rate improvements. Four 
main areas for further work are identified. 

First, the current database may be expanded to include additional speakers and the 
vocabulary size may be increased to include possible voice commands considering the 
development of a human-computer interface under consideration. In addition, a “clean” 
database needs to be collected inside a noise-controlled room to more formally 
investigate the impact of noise on the in-ear microphone data. Next, a set of standard 
noise databases with known SNR levels should be added synthetically to distort the data 
in order to investigate the performance degradation under known SNR levels. 

Second, it was observed that some word errors, especially for the word “Left” in 
the vocabulary, were due to the misdetection of speech endpoints. Further improvement 
should be possible with an improved speech segmentation algorithm development. A 
possible alternative segmentation algorithm may include entropy-based measures to 
locate the speech boundaries in noisy environments better. 

Third, speech features other than the MFCC and LPC-CC may be studied in an 
effort to determine the proper set of features for the in-ear microphone data. An 
alternative to the MFCC parameters is the use of Perceptual Linear Prediction (PLP) 
coefficients. Further, to augment the spectral speech features an energy term such as the 
log of frame-based speech energy can be appended to the speech parameters. The use of 
second order differential coefficients (delta-delta parameters) may also be investigated. 

Finally, performance improvement may be achieved by implementing different 
model structures. The current isolated word recognizer was a simple discrete-observation 
HMM set-up because discrete systems require low run-time computation. However, 
vector quantization reduces the accuracy by introducing some distortion. Therefore, a 
continuous-density HMM structure using Gaussian mixture densities could be selected 
for investigation. Another limitation of the current HMM set-up is due to the first-order 
Markov chain assumption which provides a temporal structure for modeling the time 
evolution of speech spectral characteristics from one state into the next by state 
transitions. Ideally, the observation vectors are assumed to be independent and identically 
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distributed (IID) within each state. The IID assumption suggests that there is no 
correlation between successive observation vectors (or speech feature vectors), which is 
not strictly true in terms of basic linguistic units. Therefore, a higher-order model 
structure with discrete observations could also be investigated. 

Alternatively, the incorporation of state duration densities into the HMMs, 
applying the error corrective training method or Bayesian Network adaptation might 
initiate a wide range of research topics and improve system performance. 
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APPENDIX A. SPEECH ERONT-END DETECTION MATLAB 

PROGRAMS 


This appendix includes the MATLAB programs used in the front-end detector of 

the isolated word recognition system. These programs determine the endpoints of speech 

boundaries and crop speech signals based on the detected endpoints. References show the 

specific endpoint detection algorithms implemented in the codes. 

1. SPEECH ENDPOINT DETECTION FROM THE SHORT-TERM 
ENERGY AND ZERO-CROSSING RATE 

% Filename : endpointl.m 

% Written by : R.S. Kurcan, LTJG. TU NAVY. 

% Date last revised : 10\18\2005. 

% Purpose : This function reads in a speech file and applies an endpoint 

% detection algorithm based on Short-term Energy and Zero-crossing Rate to determine 
% the speech boundaries. 

% 

function [datacropped,speechlength,speechflag] = endpointl(datadir,fs,win,fi,inc); 

% 

% Endpoint Detection Using Short-term Energy and Zero-crossing Rate. 

% <Inputs> 

% datadir 

% fs 

% win 

% 

% 

% fi 

% inc 

% <Outputs> 

% datacropped ; This returns the cropped speech signal based on endpoint 

% detection. 

% speechfiag ; If any speechfiag is set for warning any single one or 
% combination of the following problems might have occurred. 

% 1. Split data file may be corrupt if norm(s)<0.1 

% 2. Infinite loop in raw search algorithm for speech 

% boundaries if there is no signal detected that has 20 frames or longer 
% duration. 

% 3. Speech portion of the signal starting too early 

% might indicate a possible noise sequence in the beginning due to 
% bad recording. 

% All above cases need further investigation after cropping speech to determine if 
% the cropped data is corrupt. 

% speechlength : It is the length of the cropped speech signal. 

% 


: Directory of the txt file containing speech signal. 

: Sampling frequency in Hz (Default 8000). 

: Window type 'R' Rectangular window in time. 

"N' Hannning window in time. 

'H' Hamming window in time. 

:Erame length (Default 80 samples corresponding to 10 ms), 
ilncrement in num of samples (Default 40). 
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% <References> 

% [1] L. R. Rabiner and M. R. Sambur, “An algorithm for determining the endpoints 
% of isolated utteranees,” The Bell System Technical Journal, February 1975. 


% 


%■ 


ele, elear('s'); 

if nargin < 2 fs = 8000; end 

if nargin < 3 win = 'R'; end % Default Reetangular Window, 
if nargin < 4 fl = fs/100; end % Default frame length is 80 for 10 ms. 
if nargin < 5 ine = fl/2; end % Default inc.in overlapping window is 40. 
if win == 'R' 

% Default Reetangular Window of length 80 for 10 ms frames. 

w = reetwin(fl); 

else if win =='N' w = hann(fl); 

else if win =='H' w = hammmg(fl); end 
end 
end 

s = load(datadir); 

sm = s-mean(s); % Takes the mean out of data before proeessing. 

speeehflag = 0; 

% HR Elliptieal BPF Design 

%d = fdesign.bandpass(100/4000,150/4000,2000/4000,2400/4000,60,0.5,60); 
%hd = elhp(d); 

%sf = fdter(hd,sm); % Filters the speeeh data. 

% Design of a first-order pre-emphasis filter 
%a= 1; 

%b = [1,-15/16]; 

%sf = filter(b,a,sm); 

% HR Elliptieal HPE Design 

d = fdesign.highpass('n,fst,fp,ap', 10,60/4000,100/4000,0.5); 
hd = ellip(d); 
sf = filter(hd,sm); 

% Checking bad split trials below if any. 

if norm(sf) < 1.5; datacropped = []; speeehflag = 1; speechlength = 0; 
else 

% Truncating the data to make it divisible by frame length, 
m = fioor(length(sf)/mc); 
sf = sf(l:m*mc); 
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% Short-term energy per frame is STE and Zero crossing count 
% per frame is ZCR. % 50 overlaping frames taken by default, 
for n=l:m-l; 

sw = sf(inc*(n-l)+l;inc*(n-l)+fl).*w; STE(n) = sum(abs(sw)); 
for j= 2:fl; 

zc0)=abs(sign(sw0))-sign(sw0-l)))/2; 

end 

ZCR(n) = sum(zc); 
end 

% Assuming there is no speech during the first 100 ms of recordings, 

% Mean and standard deviation of ZCR and STE for the first ten frames, 
avgzcr = mean(ZCR( 1:10)); stdzcr = std(ZCR( 1:10)); 
avgste = mean(STE(l:10)); stdste = std(STE(l:10)); 

% Setting up STE Upper and Eower Thresholds and ZCR Threshold. 

IE = fi/4; % 20 crossings per 10 ms is a fixed value when N=80; 

IZCT = min(IE,avgzcr+stdzcr); % Zero-crossing Rate Threshold. 

IE = 0.25; % Upper level for avgste in case high noise 

% present in first 20 frames, 
minste = min(IE,avgste+stdste); 

ITE = 8*minste; % Eower threshold for STE. 

ITU = 32*minste; % Upper threshold for STE. 

ET = (ITU-ITE)/2 + ITE; % Eine threshold for STE. 

% Eollowing algorithm firsts makes a raw search for endpoint detection 
% based on STE, ITU and UTE. Then it refines the speech boundaries using 
% ZCR and IZCT for the successive and preceding 6 frames forth and back. 

% If the interval btwn rawstart and rawend is less than 100 ms (20 frames) 

% algorithm assumes that it is just a click noise (or a spike), not speech. 

duration = 0; rawend = 0; rawstart = 0; loopn =0; d = 1; refindex = 0; 
while duration <21; % Raw search for speech boundaries starts, 

for n = d:m-l 
ifSTE(n)>ITU; 

refindex = n; rawstart = n; 
for 1 = refindex:-1.0:1 

if STE(l) < ET; rawstart = 1; break; end 
end 
break; 

% If STE of speech signal does not exceed upper threshold, ITU is set to ITU = ITU/2 
% and makes one more search, 
else if n == m-1; 

ITU = ITU/2; ET = ET/2; 
for V = I:m-I 
ifSTE(v)>ITU; 
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refmdex = v; rawstart = v; 
for p = refindex:-! .0; 1 

if STE(p) < FT; rawstart = p; break; end 
end 

break; 

end 

end 

end 

end 

end 

if refindex 0; 
for k = refindex :m-l 

if STE(k) < FT; rawend = k; break; end 
end 
end 

duration = rawend - rawstart; % If duration < 20 frames, it keeps seanning. 
if (duration == 0); speeehfiag =2; break; end 
if (duration < 0); speeehfiag =3; break; end 
loopn = loopn+1; 

if loopn >10; speeehfiag = 4; speeehlength = 0; break; end % Prevents infinite loop, 
d = rawend+1; 
end 

% Fine search using total number of intervals crossing threshold of ZCR. 
finestart = rawstart; fineend = rawend; 

if (duration>23)&&(duration<50) 

nzc = 0; update = 0; 

if (rawstart-6) > 0; 

for n = rawstart:-! :(rawstart-6) 

if ZCR(n) > IZCT; nzc = nzc+1; update = n; end 
end 

if nzc > 3; finestart = update; end 
end 

nzc = 0; update = 0; 

if ((rawend+6) < length(ZCR))&&(rawend ~= 0); 
for n = rawend: (rawend+6) 

if ZCR(n) > IZCT; nzc = nzc+1; update = n; end 
end 

if nzc > 3; fineend = update; end 
end 
end 

starti = inc*(finestart)+l; 
endi = inc*(fmeend); 
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if (endi-starti) > 878; 
speechlength = endi-starti; 
datacropped = sf(starti:endi); 
speechflag = 0; 

else speeehlength = 0; dataeropped = sf; 
end 

% Plots of STE, ZCR, and original speeeh signal, thresholds and resulting 
% endpoints on the speeeh signal. 

%figure(l); 

%subplot(3,l,l); plot(STE); hold on, 

%subplot(3,l,l), plot(l;length(STE),ITEl,'r'), hold on; 

%subplot(3,1,1), plot( 1 ;length(STE),ITE,'g'); 

%title('Short-term Energy'); ylabel('Magnitude'); xlabel('Erame Number'); 

%subplot(3,l,2); plot(ZCR); hold on; 

%subplot(3,l,2); plot(l;length(ZCR),IZCT,'r'); 

%title('Zero-orossing Rate'); ylabel('Zero-erossings'); 

%xlabel('Erame Number'); 

%subplot(3,l,3); plot(s); hold on; 

%subplot(3,l,3); plot(starti,-2:0.1;2,'r'); hold on; 

%subplot(3,1,3); plot(endi,-2:0.1:2,'r'); 

%title(datadir); ylabel('Magnitude'); 

%xlabel('Sample Number'); 

%figure(2); 

%subplot(2,l,l); plot(s(starti:endi)); title('Original Speeeh Cropped'); 
%subplot(2,l,2); plot(sf(starti:endi)); title(datadir); % filtered speeeh eropped. 

end 


% - 

% End of funetion endpointl.m 
%- 


2. SPEECH ENDPOINT DETECTION FROM THE FRAME-BASED 
MODIFIED TEAGER ENERGY 


% 

% 

% 

% 

% 


Eilename 
Written by 
Date last revised 
Purpose 


endpoint2.m 

R.S. Kurean, ETJG. TUNAVY. 

11\08\2005. 

This funetion reads in a speeeh file and applies an endpoint 
deteetion algorithm based on the frame-based modified Teager’s energy to determine 
% the speeeh boundaries. 

% 

funetion [dataoropped,speeohlength,speeohflag] = endpoint2(datadir,fs,win,fl,ino); 
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% Endpoint Detection Using Frame-based Modified Teager’s Energy. 

% <Inputs> 

:Directory of the txt file containing speech signal. 
:Sampling frequency in Hz (Default 8000). 
iWindow type 'R' Rectangular window in time. 

"N' Hannning window in time. 

'H' Hamming window in time. 

:Frame length (Default 80 samples corresponding to 10 ms), 
increment in num of samples (Default 40). 


% 

% 

% 

% 

% 

% 

% 


datadir 

fs 

win 


fl 


fit returns the cropped speech signal based on endpoint 


me 

% <Outputs> 

% datacropped 
% detection. 

% speechflag :If speechflag is set to 1 it returns a warning regarding 
% any single one or combination of the following problems. 

% 1. Split data file may be corrupt if norm(s)<3 

% 2. Infinite loop in raw search algorithm for speech 

% boundaries if there is no signal detected that has 20 frames or longer 
% duration. 


% 3. Speech portion of the signal starting too early 

% might indicate a possible noise sequence in the beginning due to 
% bad recording. 

% All above cases need further investigation after cropping to make sure 
% the cropped data is not corrupt. 

% <References> 

% [1] G. S. Ying, C. D. Mitchell, E. H. Jamieson, “Endpoint detection of isolated 
% utterances based on a modified Teager energy measurement,” Proceedings of 1993 
% IEEE International Conference on Acoustics, Speech, and Signal Processing 
% (ICASSP-93), Vol.2, pp.732-735, April 1993. 


%■ 


if nargin < 2 fs = 8000; end 

if nargin < 3 win = 'R'; end % Default Rectangular Window, 
if nargin < 4 fl = fs/50; end % Default frame length is 160 for 20 ms. 
if nargin < 5 inc = fl/2; end % Default inc.in overlapping window is 80. 
if win == 'R' 

% Default Rectangular Window of length 160 for 20 ms frames. 

w = rectwin(fl); 

else if win =='N' w = hann(fl); 

else if win =='H' w = hamming(fi); end 
end 
end 

s = load( datadir); 

sm = s-mean(s); % Takes the mean out of data before processing, 

speechflag = 0; 

if norm(sm) <1.5; datacropped = []; speechflag =1; % Checking bad split trials if any. 
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else 


% HR Elliptieal BPF Design 

%d = fdesign.bandpass(100/4000,150/4000,2000/4000,2400/4000,60,0.5,60); 

%hd = ellip(d); sf = fdter(hd,sm); % Filters the speeeh data. 

% First-order pre-emphasis filter 
%a= fib = [1,-15/16]; 

%sf = fdter(b,a,sm); 

% HR Flliptieal HPF filter 

d = fdesign.highpass('n,fst,fp,ap', 6, 100/4000, 150/4000, 0.5); 
hd = ellip(d); sf = filter(hd,sm); 

% Frame-based Teager Energy Measure 
% 1. Caleulate the power speetrum 

% 2. Weight eaeh sample in the power speetrum with the square of the frequeney 
% 3. Take the square root of the sum of the weighted power speetrum. 

% Zero padding signal if neeessary before % 50 overlapping windowing, 
ml = (length(sf)/mo); 

ml = oeil(ml); sf(length(sf):mo*ml) = zeros; 
end 

% Frame-based Teager's Energy Measurement, FTE 
f=fs*(0;64)/128; 

Wf = f.^2; % Weighting faetor for power speetrum of eaeh frame, 

for n=fiml-fi 

sw = sf(80*(n-l)+fi80*(n-l)+160).*w; 
swfft = fft(sw,128); 

Ps = swfft. *oonj (swfft) /128; 

WPs = transpose(Ps(fi65)).*Wf; % Power Speet. weighted by the square of the fireq. 
FTE(n) = sqrt(sum(WPs)); 
end 

% Endpoint deteetion using Teagar's Frame-based Modified Energy, 
maxft = mean(FTE( 1; 10))+std(FTE( 1; 10)); 
maxfte = mm( 12,maxft); 

ITFT = 6*maxfte; 

ITFIT = 18*maxfte; 

dur = 0; rend = 0; rstart = 0; dt = 1; loopn = 0; 
while dur < 20; 
for n = dt:(ml-l) 
if FTE(n) > ITU T; 
rindex = n; 
for 1 = rindex:-1.0:1 
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if FTE(l) < ITL T; rstart = 1; break; end 
end 
break 
end 
end 

for k = rindex:ml-l 

if FTE(k) < ITE T; rend = k; break; end 
end 

dt = rend+1; 
dur = rend - rstart; 
loopn = loopn +1; 

if loopn > 5; speeehflag =0; speechlength =0; break; end 
end 

startt = 80*rstart; 
endt = 80 * (rend-1); 
speeehlength = startt-endt; 
dataeropped = sf( startt: endt); 

% Plots of ste, zcr, EXE, original speeeh signal, thresholds and resulting 
% endpoints on the speeeh signal. 
figure(l); subplot(2,l,l); plot(ETE); hold on; 
subplot(2,l,l), plot(l:length(ETE),ITEl_T,'r'), hold on; 
subplot(2,l,l), plot(l:length(ETE),ITE_T,'g'); 

title('Erame-based Teagers Energy'); ylabel('Magnitude'); xlabel('Erame Number'); 
subplot(2,l,2); plot(s); hold on; 
subplot(2,l,2); plot(startt,-2:0.1:2,'g'); hold on; 
subplot(2,l ,2); plot(endt,-2:0.1:2,'g'); 

title('MOVE'); ylabel('Magnitude'); xlabel('Sample Number'); 
end 

% - 

% End of funetion endpoint2.m 

% - 


3. SPEECH CROPPING BASED ON ENDPOINT DETECTION 

% Filename : eropspeeeh.m 

% Written by : R.S. Kurcan, ETJG. TU NAVY. 

% Date last revised : 11\15\2005. 

% Purpose : This funetion erops the individual speeeh fdes 

% in the the “Split” folder and saves them in the “Cropped” folder based 
% on an endpoint detection algorithm. 

% <Description> 

% 

% This function crops the split speech files located in the user-specified 
% 'Data' directory and saves them in the subfolder, “Cropped.” 

% User specifies the directory of the folder where the speech data is 
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% stored. This routine first lists the personal folders, then it finds the 
% split folder under eaeh person and then lists the folders of 7 words. 
% Afterward, the algorithm calls for the endpoint.m function for each 
% word (txt file). Cropped data file is saved under the folder 'Cropped' 
% unless it is corrupt. If it is corrupt it is saved under folder 'Corrupt' 

% - 


w = cd; 

user entry = input('Specify the directory of the folder for the speech data:\n','s'); 

indfolder = dir(user_entry); 

max length = 3000; min length = 2500; 

for n = 3:length(ind_folder) 
parentdir = strcat(user_entry,'\',ind_folder(n).name); 
cd(parentdir); 

mkdir('Cropped'); mkdir('Corrupf); 
c d(strc at(parentdir, '\Cropped')); 

mkdir('down'); mkdir('kiir); mkdir('kill_noise'); mkdir('leff); 
mkdir('move'); mkdir('pan'); mkdir('righf); mkdir('right_outside'); 
mkdir('up'); 

cpath = strcat(parentdir,'\Splif); 

cd(cpath); 

splitfolder = dir; 

flags = []; flags.stat = []; fiags.name = []; g = 1; 

for m = 3;length(spht_folder) 

wpath = strcat(cpath,'\',split_folder(m).name); 

cd(wpath); 

wordfolder = dir; 

cd(w); 

%zlen = []; p = 1; 

for k = 3;length(word_folder) 

datapath = strcat(wpath,'\',word_folder(k).name); 

[datacropped,speechlength,speechfiag] = endpointl (datapath); % Calls the 
% endpoint detection function. 

fprintf(l,'... Cropping file: %s and length, %6.2fn',datapath, speechlength); 

if (speechflag == 0) % If cropped data is not a bad one. 
if speechlength > max length; 
maxlength = speechlength; 
maxfile = word_folder(k).name; 
else if (speechlength < min_length)&&(speechlength ~= 0); 
minlength = speechlength; 
minfile = word_folder(k).name; 
end 
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end 

savepath =strcat(parentdir,'\CroppedV,split_folder(m).name,... 

'\' ,word_folder(k). name); 
save(savepath, 'datacroppedV-ASCII'); 

%zlen(p) = speeehlength; 

%p = p+1; 

else % If cropped data is a bad trial. 

savepath =strcat(parentdir,'\Corrupt\',word_folder(k).name); 
save(savepath, 'datacroppedV-ASCII'); 
flags(g).name = word_folder(k).name; 
flags(g).stat = speechflag; 
g = g+i; 
end 

clear('datacroppedVspeechlengthVspeechflag'); 

end 

%savepath =strcat(parentdir,'\CroppedV,split_folder(m).name,'\zlen.txt'); 
%save(savepath, 'zlenV-ASCII'); clear('zlen'); 

end 

savepath =strcat(parentdir,'\Corrupt'); 
cd(savepath); 
save('flagstat.mat', ’flags'); 
clear('flags'); 
end 

maxfile 

maxlength 

minfile 

minlength 

% - 

% End of function endpoint!.m 

% - 
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APPENDIX B. SPEECH FEATURE EXTRACTION AND VECTOR 
QUANTIZATION MATLAB PROGRAMS 


This appendix includes the MATLAB programs used for feature extraction 
(MFCC and LPC-CC computation), vector quantization and codebook generation. All the 
MATLAB codes originated from those available at [Sorensen, 2005] and [Brookes, 
1997]. Either the original codes were used or they were modified and the source of code 
cited in the header of each program. 

1. MEL-FREQUENCY CEPSTRAL COEFFICIENT (MFCC) 
COMPUTATION 

% Filename ; melcepst.m 

% Written by : Mike Brookes (VOICEBOX). 

% Date last revised ; 02/21/2005. 

% Modified by : R.S. Kurcan, ETJG. TU NAVY. 

% Date last modified ; 10/25/2005. 

% Purpose ; This routine computes the Mel-cepstrum Cepstral 

% Coefficients (MFCC) of a speech signal on a frame-by-frame basis. This 
% routine calls the following subroutines: enframe.m, melbankm.m, 

% rfft.m and rdct.m which were also written by Mike Brookes and appended 
% to the end of the routine. 

% 

function c = melcepst(s,fs,w,nc,p,n,inc) 

% 

% Simple use: c=melcepst(s,fs) % calculate mel cepstrum with 12 coefs, 

% 256 sample frames 

% c=melcepst(s,fs,'e0dD') % include log energy, 0th cepstral coef, 

% delta and delta-delta coefs 
% 

% <Inputs> 

% s : speech signal 

% fs :sample rate in Hz (default 11025) 

% nc :number of cepstral coefficients excluding O'th coef. (default 12) 

% n : length of frame (default power of 2 <30 ms)) 

% p :number of filters in filterbank (default floor(3*log(fs)) ) 

% inc : frame increment (default n/2) 

% 

% w :any sensible combination of the following: 

% 

% 'R' : rectangular window in time domain 

% 'N' : Hanning window in time domain 

% 'M' :Hamming window in time domain (default) 

% 
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% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 


't' :triangular shaped filters in mel domain (default) 

'n' :hanning shaped filters in mel domain 
'm' :hamming shaped filters in mel domain 

'p' :filters aet in the power domain 

'a' : filters aet in the absolute magnitude domain (default) 

'O' finelude O'th order eepstral eoeffieient 

'e' finelude log energy 

'd' finelude delta eoeffieients (de/dt) 

'D' finelude delta-delta eoeffieients (d^2e/dt^2) 

'z' :highest and lowest filters taper down to zero (default) 

'y' : lowest filter remains at 1 down to 0 frequeney and 

highest filter remains at 1 up to nyquist freqeney 

If 'ty' or 'ny' is speeified, the total power in the fft is preserved. 


% <Output> e :mel eepstrum output: one frame per row 
% 


% Copyright (C) Mike Brookes 1997 
% 


% 

% 

% 


VOICEBOX is a MATLAB toolbox for speeeh proeessing. Home page is at 
http://www.ee.ie.ae.uk/hp/staff/dmb/voieebox/voieebox.html 


format long; 

if nargm<2 fs=8000; end 

if nargin<3 w='M'; end 

if nargm<4 ne=12; end 

if nargm<5 p=floor(3*log(fs)); end 

if nargm<6 n=pow2(floor(log2(0.03*fs))); end 

if nargm<7 me=floor(n/2); end 

fh=0.5; % fh high end of highest filter as a fraetion of fs. 

fl=0; % fl low end of the lowest filter as a fraetion of fs. 

if length(w)==0; w='M'; end 
if any(w=='R') 
z=enframe(s,nfine); 
elseif any (w=='N') 
z=enframe(s,hannmg(n),me); 
else 

z=enframe(s,hamming(n),ine); % Hamming window in time domain (default), 
end 

f=fft(z.'); 
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[m,a,b]=melbankm(p,n,fs,fl,fh,w); 
pw=f(a;b,;).*conj'(f(a;b,;)); 
pth=max(pw(:))* lE-6; 

if any(w=='p') % filters act in the power domain. 

y=log(max(m*pw,pth)); 

else % filters act in the absolute magnitude domain (default). 

ath=sqrt(pth); 

y=log(max(m*abs(f(a:b,:)),ath)); 

end 

c=dct(y).'; 
nf=size(c,l); 
nc=nc+l; 
if p>nc 

c(:,nc+l:end)=[]; 
elseif p<nc 
c=[c zeros(nf,nc-p)]; 
end 

if ~any(w=='0') % It is better not to include 0th order mel cepstrum 

% coefficient due to its more pronounced high amplitude. 
c(:,!)=[]; 
nc=nc-l; 
end 

if any(w=='e') % include log energy. 

c=[log(sum(pw)).' c]; 
nc=nc+l; 
end 

% calculate derivative 


if any(w=='D') 
vf=(4;-l;-4)/60; 
af=(l:-l:-l)/2; 
ww=ones(5,l); 
cx=[c(ww,:); c; c(nPww,:)]; 
vx=reshape(filter(vf, 1 ,cx(:)),nf+10,nc); 
vx(l;8,:)=[]; 

ax=reshape(filter(af, 1 ,vx(:)),nf+2,nc); 
ax(l:2,:)=[]; 
vx([l nf+2],:)=[]; 
if any(w=='d') 
c=[c vx ax]; 
else 

c=[c ax]; 
end 

elseif any(w=='d') 
vf=(4:-l;-4)/60; 
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ww=ones(4,l); 

cx=[c(ww,:); c; c(nPww,:)]; 

vx=reshape(filter(vf,l,cx(:)),nf+8,nc); 

vx(l;8,:)=[]; 

c=[c vx]; 

end 

if nargouKl 
[nf,nc]=size(c); 
t=((0 :nf-1 )* inc+(n-1 )/2)/fs; 
ci=( 1 :nc)-any(w=='0')-any(w=='e'); 
imh = imagesc(t,ci,c.'); 
axis('xy'); 
xlabel('Time (s)'); 
ylabel('Mel-cepstrum coefficient'); 
map = (0;63)’/63; 
colormap(jet(64)); 

% colormap([map map map]); 
colorbar; 

end 


% - 

% End of routine melcepst.m 

% - 

% SUBROUTINE 
function f=enframe(x,win,inc) 

% ENFRAME split signal up into (overlapping) frames: one per row. F=(X,WIN,INC) 
% 

% F = ENFRAME(X,EEN) splits the vector X up into 
% frames. Each frame is of length EEN and occupies 
% one row of the output matrix. The last few frames of X 
% will be ignored if its length is not divisible by EEN. 

% It is an error if X is shorter than EEN. 

% 

% F = ENFRAME(X,EEN,INC) has frames beginning at increments of INC 
% The centre of frame I is X((I-l)*INC+(EEN+l)/2) for 1=1,2,... 

% The number of frames is fix((length(X)-EEN+INC)/INC) 

% 

% F = ENFRAME(X,WINDOW) or ENFRAME(X,WINDOW,INC) multiplies 
% each frame by WINDOW(:) 

% Copyright (C) Mike Brookes 1997 
% Version: enframe.m,v 1.3 2005/02/21 15:22:12 
% 

% VOICEBOX is a MATLAB toolbox for speech processing. 
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% Home page; http://www.ee.ie.ae.uk/hp/staff/dmb/voieebox/voicebox.html 
% 


nx=length(x); 
nwin=length(win); 
if (nwin == 1) 
len = win; 
else 

len = nwin; 
end 

if (nargin < 3) 
inc = len; 
end 

nf = fix((nx-len+inc)/inc); 
f=zeros(nf,len); 
indf= inc*(0:(nf-l)).'; 
inds = (l;len); 

f(:) = x(indf(:,ones(l,len))+inds(ones(nf,l),:)); 
if (nwin >1) 
w = win(:)'; 
f = f .* w(ones(nf,l),:); 
end 


% - 

% End of subroutine enframe.m 

% - 

% SUBROUTINE 

function [x,mn,mx]=melbankm(p,n,fs,fl,fh,w) 

% MELBANKM determine matrix for a mel-spaced filterbank 
% [X,MN,MX]=(P,N,ES,EE,EH,W) 

% 

% <Inputs> 


% 

p 


mumber of filters in filterbank 

% 

n 


: length of fft 

% 

fs 


: sample rate in Hz 

% 

fl 


:low end of the lowest filter as a fraction of fs (default = 0) 

% 

fh 


: high end of highest filter as a fraction of fs (default = 0.5) 

% 

w 


:any sensible combination of the following: 

% 


'f 

:triangular shaped filters in mel domain (default) 

% 


'n' 

:hanning shaped filters in mel domain 

% 

% 

% 


'm' 

:hamming shaped filters in mel domain 


'z' 

:highest and lowest filters taper down to zero (default) 

% 


y 

: lowest filter remains at 1 down to 0 frequency and 
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% highest filter remains at 1 up to nyquist freqeney 

% 

% If 'ty' or 'ny' is speeified, the total power in the fft is preserved. 

% 

% <Outputs> 

% X :a sparse matrix eontaining the filterbank amplitudes 

% If X is the only output argument then size(x)=[p,l+floor(n/2)] 

% otherwise size(x)=[p,mx-mn+l] 

% mn :the lowest fft bin with a non-zero eoeffieient 

% mx :the highest fft bin with a non-zero eoeffieient 

% 

% <Usage> 

% f=fft(s); f=fft(s); 

% x=melbankm(p,n,fs); [x,na,nb]=melbankm(p,n,fs); 

% n2=l+fioor(n/2); z=log(x*(f(na:nb)).*eonj(f(na;nb))); 

% z=log(x*abs(f(l:n2)).^2); 

% e=det(z); e( !)=[]; 

% 

% To plot filterbanks e.g. plot(melbankm(20,256,8000)') 

% 

% Copyright (C) Mike Brookes 1997 
% Version: melbankm.m,v 1.3 2005/02/21 15:22:13 
% 

% VOICEBOX is a MATLAB toolbox for speeeh proeessing. 

% Home page: http://www.ee.ie.ae.uk/hp/staff/dmb/voieebox/voieebox.html 
% 


if nargin < 6 
w='tz'; 
if nargin < 5 
fh=0.5; 
if nargin < 4 
fi=0; 
end 
end 
end 

fl)=700/fs; 
fn2=fioor(n/2); 
lr=log((fl)+fh)/(fO+fi))/(p+1); 

% eonvert to fft bin numbers with 0 for DC term 

bl=n*((f0+fi)*exp([0 1 p p+l]*lr)-fO); 

b2=eeil(bl(2)); 

b3=floor(bl(3)); 

if any(w=='y') 

pf=log((f0+(b2:b3)/n)/(f0+fl))/lr; 
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fp=floor(pf); 

r=[ones(l,b2) fp fp+1 p*ones(l,fn2-b3)]; 
c=[l;b3+l b2+l:fn2+l]; 

v=2*[0.5 ones(l,b2-l) 1-pf+fppf-fp ones(l,fn2-b3-l) 0.5]; 
mn=l; 
mx=fn2+l; 
else 

bl=floor(bl(l))+l; 
b4=min(fn2,eeil(bl(4)))-1; 
pf=log((fD+(bl :b4)/n)/(fD+fl))/lr; 
fp=floor(pf); 
pm=pf-fp; 
k2=b2-bl+l; 
k3=b3-bl+l; 
k4=b4-bl+l; 
r=[fp(k2:k4) l+fp(l:k3)]; 
e=[k2:k4 l:k3]; 
v=2*[l-pm(k2:k4) pm(l:k3)]; 
mn=bl+l; 
mx=b4+l; 
end 

if any(w=='n') 
v=l-eos(v*pi/2); 
elseif any(w=='m') 
v=l-0.92/1.08*eos(v*pi/2); 
end 

if nargout > 1 
x=sparse(r,e,v); 
else 

x=sparse(r,e+mn-l ,v,p, l+fn2); 
end 

% - 

% End of subroutine melbankm.m 
%- 

% SUBROUTINE 
funetion y=rfft(x,n,d) 

% REET EET of real data Y=(X,N) 

% Data is truneated/padded to length N if speeified. 

% N even; (N+2)/2 points are returned with 
% the first and last being real 

% N odd; (N+l)/2 points are returned with the 
% first being real 

% In all eases fix(l+N/2) points are returned 
% 
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% Copyright (C) Mike Brookes 1998 
% Version: rfft.m,v 1.3 2005/02/21 15:22:14 
% 

% VOICEBOX is a MATLAB toolbox for speech processing. 

% Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 
% 


s=size(x); 
if prod(s)==l 
y=x 
else 

if nargin <3 
d=fmd(s>l); 
d=d(l); 
if nargin<2 
n=s(d); 
end 
end 

if isempty(n) 
n=s(d); 
end 

y=fft(x,n,d); 

y=reshape(y,prod(s(l :d-l)),n,prod(s(d+l :end))); 
s(d)=l+fix(n/2); 
y(:,s(d)+l:end,:)=[]; 
y=reshape(y,s); 
end 


% - 

% End of subroutine rfft.m 

% - 

% SUBROUTINE 
function y=rdct(x,n,a,b) 

%RDCT Discrete cosine transform of real data Y=(X,N,A,B) 

% Data is truncated/padded to length N. 

% 

% This routine is equivalent to multiplying by the matrix 
% 

% rdct(eye(n)) = diag([sqrt(2)*B/A repmat(2/A,l,n-l)]) * cos((0:n-l)'*(0.5:n)*pi/n) 

% 

% Default values of the scaling factors are A=sqrt(2N) and B=1 which results in an 
% orthogonal matrix. Other common values are A=1 or N and/or B=1 or sqrt(2). 

% If b~=l then the columns are no longer orthogonal. 

% 
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% Copyright (C) Mike Brookes 1998 
% Version; rdot.m,v 1.4 2005/02/21 15:22:14 dmb 
% 

% VOICEBOX is a MATLAB toolbox for speeeh proeessing. 

% Home page; http;//www.ee.ie.ae.uk/hp/staff/dmb/voieebox/voieebox.html 
% 


fl=size(x,l)==l; 
if fl x=x(:); end 
[m,k]=size(x); 
if nargin<2 n=m; 
end 

if nargin<4 b=l; 

if nargin<3 a=sqrt(2*n); 

end 

end 

if n>m x=[x; zeros(n-m,k)]; 
elseif n<m x(n+l:m, ;)=[]; 
end 

x=[x(l:2:n,:); x(2*fix(n/2):-2:2,:)]; 
z=[sqrt(2) 2*exp((-0.5i*pi/n)*(l:n-l))].'; 
y=real(fft(x). * z(:, ones( 1 ,k)))/a; 
y(l,:)=y(l,:)=^b; 
if fl y=y.'; end 


% - 

% End of subroutine rdet.m 

% - 

2. LPC-DERIVED CEPSTRAL COEFFICIENT (LPC-CC) COMPUTATION 

% Eilename ; hmmfeatures.m 

% Written by : Peter S.K. Hansen, IMM, Teehnical University of Denmark. 

% Date last revised ; 09/30/2000. 

% Purpose ; This routine eomputes the EPC-derived Cepstral Coeffieients 

% (EPC-CC) for feature extraetion. This routine ealls the subroutine durbin.m which is 
% also written by Peter S.K. Hansen so as to implement the Levinson-Durbin recursion. 
% This subroutine is appended to the end of the routine. 

function y = hmmfeatures(s,N,deltaN,M,Q) 

% hmmfeatures —> Eeature extraction for HMM recognizer. 

% 

% <Synopsis> 

% y = hmmfeatures(s,N,deltaN,M,Q) 

% 
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% <Description> 

% A frame based analysis of the speeeh signal, s, is performed to 
% give observation veetors (eolumns of y), whieh ean be used to train 
% HMMs for speeeh reeognition. 

% 

% The speeeh signal is bloeked into frames of N samples, and 
% eonseeutive frames are spaeed deltaN samples apart. Eaeh frame is 
% multiplied by an N-sample Hamming window, and Mth-order LP analysis 
% is performed. The LPC eoeffieients are then eonverted to Q eepstral 
% eoeffieients, whieh are weighted by a raised sine window. The result 
% is the first half of an observation veetor, the seeond half is the 
% differeneed eepstral eoeffieients used to add dynamie information. 

% Thus, the returned argument y is an 2Q-by-T matrix, where T is the 
% number of frames. 

% 

% <Referenees> 

% [1] J.R Deller, J.G. Proakis and F.H.L. Hansen, “Diserete-Time 
% Proeessing of Speeeh Signals”, IEEE Press, ehapter 12, (2000). 

% 

% - 


Ns = length(s); % Signal length. 

T = 1 + fix((Ns-N)/deltaN); % No. of frames. 

a =zeros(Q,l); 
gamma =zeros(Q,l); 
gamma_w = zeros(Q,T); 

win_gamma = 1 + (Q/2)*sm(pi/Q*(l;Q)'); % Cepstral window funetion. 

for (t = 1 :T) % Eoop frames. 

% Bloek into frames. 

idx = (deltaN*(t-l)+l):(deltaN*(t-l)+N); 

% Window frame. 

sw = s(idx).*hammmg(N); 

% Short-term autoeorrelation. 

[rs,eta] = xeorr(sw,M,'biased'); 

% EP analysis based on Eevinson-Durbin reeursion. 

[a(l;M),xi,kappa] = durbm(rs(M+l:2*M+l),M); % EP analysis. 


% Cepstral eoeffieients. 
gamma(l) = a(l); 
for (i = 2:Q) 
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gamma(i) = a(i) + (l:i-l)*(gamma(l;i-l).*a(i-l;-l;l))/i; 
end 

% Weighted cepstral sequenee for frame t. 
gamma_w(:,t) = gamma. *win_gamma; 
end 

% Time differenced weighted cepstral sequence, 
deltagammaw = gradient(gamma_w); 

% Observation vectors, 
y = [gamma w; delta_gamma_w]; 


% - 

% End of routine hmmfeatures.m 

% - 

% SUBROUTINE 

function [a,xi,kappa] = durbin(r,M) 

% durbin —> Eevinson-Durbin Recursion. 

% 

% <Synopsis> 

% [a,xi,kappa] = durbin(r,M) 

% 

% <Description> 

% The function solves the Toeplitz system of equations 
% 

% [ r(l) r(2) ... r(M) ] [ a(l) ] = [ r(2) ] 

% [ r(2) r(l) ...r(M-l)][ a(2) ] = [ r(3) ] 

% [ . . •][.]=[.] 

% [ r(M-l) r(M-2) ... r(2) ] [ a(M-l) ] = [ r(M) ] 

% [ r(M) r(M-l) ... r(l) ] [ a(M) ] = [ r(M+l) ] 

% 

% (also known as the Yule-Walker AR equations) using the Levinson- 
% Durbin recursion. Input r is a vector of autocorrelation 
% coefficients with lag 0 as the first element. M is the order of 
% the recursion. 

% 

% The output arguments are the M estimated EP parameters in the 
% column vector a, i.e., the AR coefficients are given by [1; -a]. 

% The prediction error energies for the Oth-order to the Mth-order 
% solution are returned in the vector xi, and the M estimated 
% reflection coefficients in the vector kappa. 

% 

% Since kappa is computed internally while computing the AR coefficients, 
% then returning kappa simultaneously is more efficient than converting 
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% vector a to kappa afterwards. 

% 

% <References> 

% [1] J.R Deller, J.G. Proakis and F.H.L. Hansen, “Discrete-Time 
% Processing of Speech Signals”, IEEE Press, p. 300, (2000). 

% 

% <Revision> 

% Peter S.K. Hansen, IMM, Technical Elniversity of Denmark 
% 

% East revised: September 30, 2000 
%- 


% Initialization, 
kappa = zeros(M,l); 
a = zeros(M,l); 
xi = [r(l); zeros(M,l)]; 

% Recursion, 
for (j=l:M) 

kappaO) = (rO+l) - a(l:j-l)'*r0:-l:2))/xi0); 
aO) = kappaO); 

a(l:j-l) = a(l:j-l) - kappa(j)*a(j-l:-l:l); 
xiO+1) = xi0)*(l - kappa0)^2); 
end 


% - 

% End of subroutine durbin.m 
%- 


3. K-MEANS CLUSTERING ALGORITHM 

% Eilename : k means.m 

% Written by : Peter S.K. Hansen, IMM, Technical University of Denmark. 

% Date last revised : 09/30/2000. 

% Purpose : This routine implements the K-means clustering algorithm for 

% the speech feature vectors. 

function [Yc,c,errlog] = k_means(Y,K,maxiter) 

% kmeans —> Train s a k-means cluster model. 

% 

% <Synopsis> 

% [Yc,c,errlog] = k_means(Y,K,maxiter) 

% 

% <Description> 

% The function uses the k-means algorithm to set the centroids of 
% a cluster model. The matrix Y represents the data which is being 
% clustered, with each row corresponding to a vector, and K is the 
% desired number of clusters. The cluster centroids are returned in 
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% the matrix Yc (rows), and the cluster number for each data vector 
% are returned in the vector c. The sum of squares error function is 
% used in the algorithm, and a log of the error values after each 
% iteration is returned in errlog. The maximum number of iterations 
% is specified by maxiter. 

% 

% <References> 

% [1] J.R Deller, J.G. Proakis and F.H.L. Hansen, “Discrete-Time 
% Processing of Speech Signals”, IEEE Press, p. 71, (2000). 

% - 


[M,N] = size(Y); % Number of vectors and vector dim. 

if(K>M) 

error('More centroids than data vectors.') 
end 

errlog = zeros(maxiter,l); % Eog of error value after each iteration. 

% Initialize centroids Yc by random selection of K vectors in Y. 
perm = randperm(M); 

Yc = Y(perm(l:K),:); 

% Constant term in squared Euclidean distance between rows in Y and Yc. 
d2y = (ones(K,l)*sum((Y.^2)'))'; 

for (i = 1 :maxiter) 

% Save old centroids to check for termination. 

Ycold = Yc; 

% Squared Euclidean distance (M-by-K matrix) between rows in Y and Yc. 
d2 = d2y + ones(M,l)*sum((Yc.^2)') - 2*Y*Yc'; 

% Assign each vector in Y to nearest centroid. 

[errvals,c] = min(d2'); 

% Adjust the centroids based on the new assignments, 
for (k= 1:K) 
if (sum(c==k)>0) 

Yc(k,:) = sum(Y(c==k,:))/sum(c==k); 
end 
end 

% Error value is the total squared distance from cluster centroids. 
errlog(i) = sum(errvals); 

fprintf(l,'... Iteration %4d — Error %I I.6f\n',i,errlog(i)); 
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% Test for termination, 
if (max(max(abs(Yo - Ye old))) < 10*eps) 
errlog = errlog(l:i); 
return 
end 
end 

% - 

% End of routine k means 

% - 

4. CODEBOOK GENERATION FOR VECTOR QUANTIZATION 

% Filename ; hmmeodebook.m 

% Written by ; Peter S.K. Hansen, IMM, Technical University of Denmark. 

% Date last revised ; 09/30/2000. 

% Modified by : R.S. Kurcan, LTJG. TU NAVY. 

% Date last modified : 12/18/2005. 

% Purpose ;This routine generates a codebook using the subroutine 

% k means.m. This codebook is used for vector quantization of the continuous-valued 
% speech feature vectors. 

function [cb,K,T,dist,index] = hmmcodebook(datadir,ntrial,Kmax,maxiter); 

% hmmeodebook.m —> Codebook generation for HMM recognizer. 

% 

% <Synopsis> 

% [cb,K,T,dist] = hmmcodebook(datadir,ntrial,Kmax,maxiter) 

% 

% <Inputs> 

% Kmax : Maximum cluster number for k means.m subfunction 
% maxiter : Maximum iteration number in k-means clustering algorithm 
% datadir : Directory of the speech data 

% ntrial ; Number of occurrences of each word to be used for 

% generating the codebook 

% 

% <Outputs> 

% cb : Codebook vectors (Locations of cluster centroids) 

% K ; Number of vector-quantized clusters 

% T : Number of observation vectors (rows) used to generate codebook. 

% dist: VQ distortion based on Euclidean distance 
% 

% <Description> 

% After user specifies the initial directory of the data folder, 

% this routine reads the data in from the cropped folder. Next it computes 
% the 12 MFCC parameters and 12 delta-MFCC parameters. Zero-order MFCC 
% coefficient is excluded from the features. Cepstral Mean Substraction is 
% applied to the static MFCC parameters. Dynamic MFCC parameters are 
% weighted by six to fit them into the range of static parameters. 
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% A feature matrix of 24 by T is computed where T shows the frame length. 

% The feature vectors for all sequences are concatenated and then vector 
% quantized (clustered) into K < Kmax feature vector (centroids). 

% The number of observation vectors used to generate the codebook is 
% returned in T, and the VQ average distortion is returned in dist. 

% 

% <References> 

% [1] J.R Deller, J.G. Proakis and F.H.L. Hansen, “Discrete-Time 
% Processing of Speech Signals”, IEEE Press, chapter 12, (2000). 

% - 

format long; 

if nargin<4 maxiter=1000; end 
if nargin<3 Kmax=128; end 
if nargin<2 ntrial=15; end 

ifnargin<l datadir='D;\MATLAB7\work\Data'; end 
cw = cd; 

indfolder = dir(datadir); 

% The structure array words.name excludes the words : 'right outside' and 
% 'kill_noise' 

words(l).name='down'; words(2).name='kiir; words(3).name='leff; 
words(4).name='move'; words(5).name='pan'; words(6).name='righf; 
words(7).name='up'; 

% Before feature extraction MECC parameters are set below, 
fs = 8000; % Sampling fireq. 

w ='Mtd'; % Hamming window in time domain, triangular shaped filters 

% in mel domain. All filters act in the absolute magnitude domain. Delta 
% parameters are computed as well. 

nc = 12; % 12 MECC plus 12 delta-MECC computed excluding O'th coef 

p = 24; % number of filters in critical-band filterbank (default 26). 

n = 256; % length of frame. 

inc = 156; % increments in the speech frames. 

coef = []; % feature matrix 

index = randperm(40); % Trial index splits the data into training set and test set. 
k = index(l mtrial); % Training index 

cntr = 1; 

for a = 3:length(ind_folder) 

parentdir = strcat(datadir,'\',ind_folder(a).name,'\Cropped\'); 
for g = 1:7 

wpath = strcat(parentdir,words(g).name); 

cd(wpath); 

wfolder = dir; 

for m = 1 mtrial 
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dpath = strcat(wpath,'\',wfolder(k(m)+2).name); 
s = load( dpath,'-ascii'); 

cd(cw); 

f = meicepst(s,fs,w,nc,p,n,inc); % returns each row as a frame 
% Cepstrai mean substraetion appiied to the i2 MFCC parameters, 
fi = f(:,i:i2); fi = fi - mean(fi(:)); 
f=[fi 6*f(:,i3:24)]; 

f=f; % Eaeh coiumn is set as a frame. 

fprmtf(i,'... Reading fde: %4d, %s, T = %4d\n',entr,... 

wfoider(k(m)+2).name,iength(f)); 


entr = entr+i; 

coef = [coef f]; % Feature matrix containing aii 

% training occurences of words. 
oiear('f,'fi','s'); 
end 
end 
end 


spath = 'D:\MATLAB7\work\Thesis\'; 

cd(spath); 

mkdir('Trammg'); 

% K-means seareh reads in rows of input as MFCC features and 
% coiumns of the data correspond to speeeh frames. 

[Ye,e,erriog] = k_means(coef,Kmax,maxiter); % Veetor quantization. 


ob= Yo(unique(c),:)'; 


% Oniy use eiusters with assignments. 


[N,K] = size(eb); % Codebook symboi size K. 

[N,T] = size(coef); % Veetor dim and number of vectors. 

dist = sqrt(erriog(end))/T; % VQ average distortion measure. 

savepath = strcat( spath,'Trammg\eb.txf); 

save(savepath, 'cb','-ascii'); 

savepath = streat(spath,'Training\triaiindex.txf); 

save(savepath, 'mdex','-ascii'); 

ed(ew); 


% - 

% End of routine hmmeodebook.m 
%- 
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APPENDIX C. HIDDEN MARKOV MODEL MATLAB PROGRAMS 


This appendix includes various MATLAB programs used for HMM-based 
recognition of the isolated words. Most of these MATLAB codes originated from those 
available at [Sorensen, 2005]. Either the original codes were used or they were modified 
and the source of code cited in the header of each program. 

1. HMM BAUM-WELCH RE-ESTIMATION 

% Filename ; hmmfh.m 

% Written by : Peter S.K. Hansen, IMM, Technical University of Denmark. 

% Date last revised ; 09/30/2000. 

% Purpose ; This subroutine implements Baum-Welch re-estimation algorithm 

% (or Forward-Backward algorithm) and is called in the routine “hmmtrain.m” for the 
% training of HMMs. 

% 

function [A,B,pil,loglike] = hmmfb(ytrain,S,K,maxiter,tol) 

% 

% <Synopsis> 

% [ A,B,pi 1,loglike] = hmmfb(ytrain,S,K,maxiter,tol) 

% 

% <Description> 

% The function implements the F-B reestimation algorithm used 
% to train a HMM based word classifier. The input arguments are 
% the F training sequences (same word), i.e., vectors in the 
% cell-array ytrain, the HMM parameters, S and K, which are the 
% number of states and the number of symbols (codebook vectors), 

% and maxiter and tol are stopping criteria in the F-B algorithm, 

% e.g., 5000 and le-3. 

% The output arguments are the S-by-S state transition matrix A, 

% the K-by-S observation probability matrix B, and the initial state 
% probability vector pil. Finally, the output vector loglike is 
% the log-likelihood as function of the iteration number in the 
% F-B reestimation algorithm. 

% 

% <References> 

% [1] J.R Deller, J.G. Proakis and F.H.F. Hansen, “Discrete-Time 
% Processing of Speech Signals”, IEEE Press, chapter 12, (2000). 

% - 

F = length(ytrain); % Number of training sequences. 

A = rand(S,S); A = A./repmat(sum(A),S,l); % Initial random model 
B = rand(K,S); B = B./repmat(sum(B),K,l); % with probability sum 
pil = rand(S,l); pil = pil/sum(pil); % normalized to one columnwise. 
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loglike = zeros(maxiter,l); 
loglikeold = -inf; 
loglikedif = inf; 


% Log-likelihood measures. 
% From previous iteration. 
% Differenee. 


k = 0; 

while (k < maxiter) & (loglike dif >= tol) 


k = k + 1; 

% Iteration eounter. 

pi e = 0; A_e = 0; B e = 0; 

% Initialize sum over 

loglikee = zeros(L,l); 

% sequenees. 

for (1= 1:L) 

y = ytrainjl}; 

% I'th sequence. 

T - length(y); 

% Length of I'th seq. 

alpha = zeros(S,T); 

% Forward recursion. 

beta = zeros(S,T); 

% Backward recursion. 

seale = zeros(T,l); 

% Scale factors. 

alpha(:,l) = piL*B(y(l),:)'; 

% Alpha for t=L 

seale(l) = sum(alpha(:,l)); 

% Scale factor for t=L 

alpha(:,l) = alpha(:,l)/seale(l); 
for (t = 2:T) 

alpha(:,t) = A*alpha(:,t-l).*B(y(t),:)'; 

% Scaled alpha for t=L 

soale(t) = sum(alpha(:,t)); 

% Scale factor for t. 

alpha(:,t) = alpha(:,t)/scale(t); 
end 

% Scaled alpha for t. 

beta(:,T) = l/seale(T); 

% Beta for t=T. 

for (t = (T-l):-l:l) 

% Scaled beta for t. 


beta(:,t) = A'*(beta(:,t+l).*B(y(t-i-l),:)')/seale(t); 
end 

loghke_e(l) = sum(loglO(scale)); % Log-likelihood, 

gammae = (alpha. *beta).*repmat(soale',S,l); 

A_o =A_o +(alpha(:,l;T-l)*(B(y(2:T),:)'.*beta(:,2:T))').*A'; 

Be = Be + (gamma_o*(repmat([l;K],T,l)==repmat(y(:),l,K)))'; 
pi_c = pi_c + gamma_o(:,l); 
end 

loghke(k) = sum(loghke_e); 
loghke_dif = loghke(k) - loglike old; 
loglikeold = loghke(k); 

pil = pi_o/sum(pi_o); 

A = A_o./repmat(sum(A_c),S,l); 

B = B_o./ repmat(sum(B_c) ,K, 1); 
fprmtf(l,'... Log likelihood: %6.3An',loghke(k)); 
end 
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loglike = loglike(2:k); 


% - 

% End of function hmmfb.m 

% - 

2. HMM VITERBI TRAINING ALGORITHM 

% Filename ; hmmviterbi.m 

% Written by : R.S. Kurcan, LTJG. TU NAVY. 

% Date last revised ; 11/28/2005. 

% Purpose ; Tbis subroutine implements tbe Viterbi Training Algorithm. 

% 

function [A,B, pil,loglike] = hmmviterbi(ytram,S,K,maxiter,tol) 

% 

% <Synopsis> 

% [A,B,loglike] = hmmviterbi(ytrain,S,K,maxiter,tol) 

% 

% <Description> 

% This function is loosely based on the file “hmmfb.m” written by Peter S.K. Hansen, 
% in an effort to follow similar notations. The function implements the Viterbi % 
algorithm used to train a HMM-based word classifier as an alternative to F-B % 
algorithm. The input arguments are the F training sequences of the same word, i.e., % 
vectors in the cell-array ytrain, the HMM parameters, S and K, which are the number % 
of states and the number of symbols (codebook vectors). Maxiter and tol are % 
stopping criteria in the Viterbi algorithm, e.g., 1000 and le-5. 

% The output arguments are the S-by-S state transition matrix A, the K-by-S 
% observation probability matrix B. The initial state probability vector pil is not 
% estimated assuming the initial state is state one in a left-to-right HMM. Finally, the 
% output vector logltke is the log-likelihood as function of the iteration number in the 
% Viterbi algorithm. 

% 

% <References> 

% [1] J.R Deller, J.G. Proakis and F.H.F. Hansen, “Discrete-Time 
% Processing of Speech Signals”, IEEE Press, chapter 12, (2000). 

% - 

F = length(ytrain); % Number of training sequences. 


A = rand(S,S); 

Au = tril(A); % Fower triangular part of A. 

A1 = triu(A, 1); % Upper triangular part of A. 

A = Au + eps*Al; % Upper tri. part set to 'eps' for a left-to-right HMM. 

A = A./repmat(sum(A),S,l); % Columns of A and B must sum 

B = rand(K,S); B = B./repmat(sum(B),K,l); % to unity. 

pil = [1; zeros(S-l,l)]; % a particular state designated as initial state. 


loglike = zeros(maxiter,l); % Fog-likelihood measures. 
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loglikeold = inf; 
loglikedif = inf; 


% From previous iteration. 
% Differenee. 


k = 0; 

while (k < maxiter) & (loglike dif >= tol) 
k = k + 1; % Iteration eounter. 

loglike = zeros(L,l); 

for (1= 1:L) 

y = ytrain{l}; % I'th sequenee. 

T = length(y); % Length of I'th seq. 

% Initialization 

Dmin = zeros(T,S); % Dmin(t,it) is distanee from (0,0) to (t,it) 

% over the best path. 

Dmin(I,:) = pi'.*B(y(I),:); % Dmin(I,i) at time t=I. 

gamma = zeros(T,S); % The best final state on the optimal partial 

% path ending at (t,it). 

% Reeursion 
for t = 2:T 
for i = I:S 

D = Dmin(t-I,:)+(-logI0(A(i,:)))+(-logI0(B(y(t),:))); 
[Dmin(t,i),gamma(t,i)] = min(D); 
end 
end 

% Termination 
bestseq = zeros(I,T); 

[mindist,argmin] = min(Dmin(T,:)); 
bestseq(T) = argmin; 

% Baektraeking 
fort = (T-I):-I:I 

bestseq(t) = gamma(t+I, bestseq(t+I)); % Best state sequenee. 
end 

loglike(l) = mindist; 
end 

loglike(k) = sum(loglike); 
loglike dif = loglike old - loglike(k); 
loglikeold = loglike(k); 

% Viterbi reestimation 
for i = I:S 

X = fmd(bestseq == i); 

nfromi(i) = length(X); % Number of transitions from state i. 
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end 

ntoj = nfromi; % Number of transitions to state j. 

nfromi(bestseq(end)) = nfromi(bestseq(end))-l; % First and last states 
ntoj'(bestseq(l)) = ntoj'(bestseq(l))-l % don't count transitions. 

nu_ji = zeros(S,S); % Number of transitions from state i to state j. 

for m =1:S-1 
i = bestseq(m); 
j = bestseq(m+l); 
nuJiOd) = nuji0,i)+l; 
end 

nu_kj = zeros(K,S); % Number of times observation k and state j occur jointly, 
for p = 1:S 
for t = 1;T 

nu_kj(y(t),bestseq(p)) = nu_kj(y(t),bestseq(p))+l; 
end 
end 

% Model parameters updated 
Am = nuji./repmat(nfromi,S,l); 

B_m = nu_kj./repmat(ntoj,K,l); 

fprintf(l,'... Log likelihood; %6.3An',loglike(k)); 

end 

loglike = loglike(2:k); 


% - 

% End of function hmmviterbi.m 

% - 

3. MULTIPLE-OBSERVATION HMM TRAINING ALGORITHM 

% Filename ; hmmtrain.m 

% Written by ; Peter S.K. Hansen, IMM, Technical University of Denmark. 

% Date last revised ; 09/30/2000. 

% Modified by ; R.S. Kurcan, LTJG. TU NAVY. 

% Date last modified : 01/05/2006. 

% Purpose ; This routine implements the multiple-observation training of 

% HMMs using the Baum-Welch re-estimation algorithm, (see also hmmfb.m) 

% 

function [A_m,B_m,pi_m,loghke_m,testidx] = hmmtrain(S,0,R,maxiter,tol,datadir); 

% 

% <Inputs> 

% S : Number of hidden states in HMMs 

% O : Number of observation sequences taken from each speaker for training of 

% the same word. 
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% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

%■ 


R 

maxiter 

tol 

datadir 


: Number of words to be reeognized 
: Max. iteration number to terminate training 
: Tolerance to terminate training 
: Directory of the cropped speech data 


<Outputs> 

A_m : Cell array for the state transition matrices of HMMs 

B_m : Cell array for the observation probability matrices of HMMs 

pi_m : Cell array for the state probability vectors of HMMs 

loglike m ; Cell array for the log likelihood of the observation 

sequences given the models. 


<Description> 

The function implements training of multiple HMMs. The training 
sequences are defined by the cell-array data. The length of the 
cell-array corresponds to the number of words, and each cell is a 
new cell array with filenames for occurrences of this spoken word. 
For each sequence, a frame based analysis is performed using the 
function y = melcepst(s,fs,w,nc,p,n,inc)to give observation 
vectors, which are then vector quantized into the possible 
codebook vectors given by cb. For each word, the (quantized) 
observation sequences are then used to train a HMM for this word, 
using the F-B reestimation algorithm. Here, S is the number of 
states in the HMM, and maxiter and tol are stopping criteria in 
the F-B algorithm, e.g., 1000 and le-5. The output probability 
measures for the HMMs are returned in the cell-arrays A m, B_m, 
pi rn, and loglike m. 


<References> 

[1] J.R Deller, J.G. Proakis and F.H.L. Hansen, “Discrete-Time 
Processing of Speech Signals”, IEEE Press, chapter 12, (2000). 


format long; 

% Default input parameterts 

if nargin < 6 datadir='C:\Program Piles\MATLAB71\work\Data'; end 

if nargin < 5 tol = le-5; end 

if nargin < 4 maxiter = 1000; end 

if nargin < 3 R = 7; end 

if nargin < 2 O = 25; end 

if nargin <1 S = 8; end 

cb = load('C:\Program Piles\MATLAB71\work\Training\cb.txf,'-ascii'); 

index = load('C:\Program Piles\MATLAB71\work\Training\trialindex.txf,'-ascii'); 
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% The structure array words.name includes the seven words to be recognized. 
words(l).name='up'; words(2).name='down'; words(3).name='left'; 
words(4).name='right'; words(5).name='kiir; words(6).name='pan'; 
words(7).name='move'; 

% MFCC parameters are set before feature extraction, 
fs = 8000; % Sampling freq. 

w ='Mtd'; % Hamming window in time domain, triangular shaped filters. 

% in mel domain. All filters act in the absolute magnitude domain. Delta 
% parameters are specified for computation as well. 

nc = 12; % 12 MFCC plusl2 delta-MFCC computed excluding O'th coef 

p = 24; % number of filters in critical-band filterbank (default 26). 

n = 256; % length of frame. 

inc = 156; % increments in the speech frames. 

coef = []; % feature matrix with each column representing a frame. 

[dim,K] = size(cb); % Get symbol size K from the codebook. 

A m =cell(R,l); % Output arguments in cell-arrays. 

B_m = cell(R,l); 
pirn = cell(R,l); 
loglikem = cell(R,l); 

cw = cd; 

indfolder = dir(datadir); 

L = (length(ind_folder)-2)*0; % Total number of occurrences of i'th word, 
trnidx = index(l :0); % Random trial indeces for training, 

testidx = index(0+l :end); % Random trial indeces for testing, 

cntr = 1; 

for i = 1 ;R % Loop words. 

ytrain = cell(L,l); % Generate cell-array for train seq. 

j = i; 

for a = 3:length(ind_folder) 

parentdir = strcat(datadir,'V,ind_folder(a).name,'\CroppedV, words(i).name); 

cd(parentdir); 

wfolder = dir; 

for m = 1:0 

s = load(wfolder(trnidx(m)-l-2).name,'-ascii'); % Load word file 
cd(cw); 

y = melcepst(s,fs,w,nc,p,n,inc); % Extract MFCC feature vectors. 

% Cepstral mean-substraction applied to the 12 MFCC parameters, 
yl = y(:,l:12); yl = yl - mean(yl(:)); 
y = [yl 6*y(:,13:24)]; 
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y = y'; % Each column is set as feature vector of a frame. 

[dim,T] = size(y); % Number of veetors T. 
yk = zeros(T,l); 

fprmtf(l,'... Reading fde; %4d, %s, T = 4d\n',entr,wfolder(trnidx(m)+2).name,T); 
entr = entr + 1; 

for (t = 1 :T) % Vector quantization. 

[dist,k] = mm(sum((cb-repmat(y(:,t),l,K)).^2)); 
yk(t) = k; % Symbol at time/frame t. 

end; 

ytrainO) = {yk}; 

j =j+i; 

elear('yVyr,'s','yk','k'); 

ed(parentdir); 

end 

end 

ed(ew); 

[A,B,pil,loglike] = hmmfb(ytram,S,K,maxiter,tol); 

A_m(i) = {A}; 

B_m(i) = {B}; 
pi_m(i)= {pil}; 
loglike_m(i) = {loglike}; 


% - 

% End of funetion hmmtrain.m 

% - 

4. HMM LOGLIKELIHOOD COMPUTATION ALGORITHM 

% Eilename : loglike.m 

% Written by : Peter S.K. Hansen, IMM, Teehnieal University of Denmark. 

% Date last revised : 09/30/2000. 

% Purpose : This routine eomputes the log-likelihood for a given 

% observation sequenee y eonsisting of veetor quantized speeeh feature 
% veetors by using Hidden Markov Model parameters. 

funetion loglike = hmmlogp(y,A,B,pil) 

% hmmlogp —> Eog-likelihood for given observation sequenee and HMM. 

% 

% <Deseription> 

% The function calculates the log-ltkelihood for a given observation sequence y, 

% and Hidden Markov Model A, B, and pil. Here, A is the S-by-S state transition 
% matrix, B is the K-by-S observation probability matrix, and pil is the initial state 
% probability veetor. 

% 
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% <References> 

% [1] J.R Deller, J.G. Proakis and F.H.L. Hansen, “Discrete-Time 
% Processing of Speech Signals”, IEEE Press, chapter 12, (2000). 
%- 


format long; 

T = length(y); % Eength of observation sequence. 

S = length(A); % Number of hidden states. 


alpha = zeros(S,T); 
beta = zeros(S,T); 
scale = zeros(T,l); 


% Eorward recursion. 
% Backward recursion. 
% Scale factors. 


alpha(:,l) = pil.*B(y(l),:)' + eps; % Alpha for t=l. 
scale(l) = sum(alpha(:,l)); % Scale factor for t=l. 

alpha(:,l) = alpha(:,l)/scale(l); % Scaled alpha for t=l. 
for (t = 2;T) 

alpha(:,t) = A*alpha(:,t-l).*B(y(t),:)'; 
scale(t) = sum(alpha(:,t)); % Scale factor for t. 

alpha(:,t) = alpha(:,t)/(scale(t) + eps); % Scaled alpha for t. 
end 


loglike = sum(loglO(scale + eps)); 

% - 

% End of function hmmlogp 

% - 

5. HMM RECOGNITION ALGORITHM 

% Eilename : hmmrecog.m 

% Written by : Peter S.K. Hansen, IMM, Technical University of Denmark. 
% Date last revised ; 09/30/2000. 

% Modified by : R.S. Kurcan, ETJG. TU NAVY. 

% Date last modified : 02/10/2006. 

% Purpose : This routine implements HMM recognition for a given 

% observation sequence based on the trained input model parameters. 

% 

function [logp,guess] = hmmrecog(A_m,B_m,pi_m,testidx); 

% hmmrecog —> HMM based word classifier. 

% 

% <Description> 

% The function implements HMM based recognition, where the cell-array 
% data contains filenames for the spoken words to be recognized. 

% for each word, a frame based analysis is performed using the 
% function y = melcepst(s,fs,w,nc,p,n,inc) to give observation 
% vectors, which are then vector quantized into the possible 
% codebook vectors stored in the file cb.txt. The resulting observation 
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% sequence and the probability measures for all the HMMs, given by the 
% cell-arrays A m, B_m, and pi rn, are then used to calculate the 
% log-likelihoods for the HMMs (column of logp). The input “testidx” 

% contains the randomly chosen trial index for the testing set. The word 
% associated with the HMM of highest log-likelihood is declared 
% to be the recognized word, and the index is returned in guess. 

% This procedure is repeated for all the words in data. 

% 

% <References> 

% [1] J.R Deller, J.G. Proakis and F.H.L. Hansen, “Discrete-Time 
% Processing of Speech Signals”, IEEE Press, chapter 12, (2000). 

% - 


format long; 

datadir='C:\Program Piles\MATLAB7 l\work\Data'; 

cb = load('C:\Program Piles\MATLAB71\work\Training\cb.txf,'-ascii'); 

%index = load('C:\Program Piles\MATLAB71\work\Training\trialindex.txf,'-ascii'); 

% The structure array words.name includes the seven words to be recognized. 
words(l).name='up'; words(2).name='down'; words(3).name='leff; 
words(4).name='righf; words(5).name='kiir; words(6).name='pan'; 
words(7).name='move'; 


% Before feature 
fs = 8000; % 

w ='Mtd'; % 

% in mel domain, 
% parameters are 


nc = 12; 

P = 24; 
n = 256; 
inc = 156; 
coef = []; 


% 

% 

% 

% 

% 


extraction MECC parameters are set below. 

Sampling freq. 

Hamming window in time domain, triangular shaped filters. 
All filters act in the absolute magnitude domain. Delta 
specified for computation as well. 

12 MECC plus 12 delta-MECC computed excluding O'th coef 
number of filters in critical-band filterbank (default 26). 
length of frame. 

increments in the speech frames. 

feature matrix with each column representing a frame. 


[dim,K] = size(cb); % # of observation symbols. 

R = length(A_m); % # of trained words (7 words). 


O = length( testidx); 


% # of test words from each speaker per word. 


cw = cd; 

indfolder = dir(datadir); 

E = (length(ind_folder)-2)*0; % Total # of ith words to be recognized/tested, 
logp = zeros(R,E); 

V = 1 ; cntr = 1 ; 
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for i = 1 :R 


% Loop words. 


for a = 3:length(ind_folder) 

parentdir = strcat(datadir,'V,ind_folder(a).name,'\CroppedV, words(i).name); 

cd(parentdir); 

wfolder = dir; 

for m = 1:0 

% Load test word signal from file, 
s = load(wfolder(testidx(m)+2).name,'-ascii'); 

cd(cw); 

y = melcepst(s,fs,w,nc,p,n,mc); % Extract 12 feature vectors. 

% Cepstral mean-substraction applied to the 6 MFCC parameters, 
yl = y(:,l:12); yl = yl - mean(yl(:)); 
y = [yl 6*y(:,13:24)]; 

y = y'; 

[dim,T] = size(y); % Number of vectors T. 

fprintf(l,'... Reading file:%4d, %s, T %4d\n',cntr,wfolder(testidx(m)-l-2).name,T); 
cntr = cntr + 1; 
yk = zeros(T,l); 

% Vector quantization, 
for (t= 1:T) 

[dist,k] = min(sum((cb-repmat(y(:,t),l,K)).^2)); 
yk(t) = k; 
end 

% Scoring 

for 0 = 1 :R) % Estimate log-likelihood for each model. 

Iogp0,v) = hmmlogp(yk,A_m{j},B_m{j},pi_m{j}); 
end 

V = v+1; 

clear('y','yr,'s','yk','k'); 

cd(parentdir); 

end 

end 

end 

cd(cw); 

[loglike,guess] = max(logp); 

% - 

% End of function hmmrecog.m 

% - 
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APPENDIX D. HMM CLASSIFICATION RESULTS FOR 
MULTIPLE EXPERIMENTS 


% Filename : hmmresult.m 

% Written by ; R.S. Kurcan, LTJG. TU NAVY. 

% Date last revised : 02/15/2005. 

% Purpose : This subroutine estimates eonfusion matriees and 

% eomputes 95 % eonfidenee intervals for N experiments 
% 

% hmmresult —> HMM based IWR elassification results and eonfusion matriees 
% 

% <Description> 

% The funetion implements estimates eonfusion matriees and eomputes 
% 95 % eonfidenee intervals for N experiments where N is the number 

% of multiple experiments. N is ehosen sueh that (N*0.05/2) should be 
% integer for 95 % eonfidenee interval eomputation. 

% - 

elear all; 

N = 80; % # of experiments 

eonf = eell(N,l); 
rate = eell(N,l); 

R = 7; 


for i=l:N 

[ob,K,T,dist,index] = hmmeodebook; 
[A_m,B_m,pi_m,loglike_m,testidx] = hmmtrain; 

[logp,guess] = hmmreeog(A_m,B_ni,pi_m,testidx); 

O =length(testidx); 
elassrate = zeros(R,R); 
for h = 1 ;R; 
for k = 1 :R; 
elassnum = 0; 
n = (O*(k-l)*20)+l; 
m = O*k*20; 
for 1 = n;m 

if guess(l) == h; elassnum = elassnum + 1; end 
end 

elassrate(k,h) = elassnum/(O*20); 
end 
end 

elassrate = elassrate* 100; 
overallrate = sum(diag(elassrate))/R; 
format short; 
oonf(i) = {elassrate}; 
rate(i) = {overallrate}; 
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clear('cbVKVTVdistVindexVA_mVB_mVpi_mVloglike_m',... 

'testidxVlogpVguessVguessVclassnumVclassrateVoverallrate'); 


rmdir('C:\Program Files\MATLAB71\work\Training','s'); 


end 


mkdir('H:\Docs\Thesis_Work\Confusion'); 

savepathl = 'H:\Docs\Thesis_Work\Confusion\conf.mat'; 

save(savepathl,'conf,'-mat'); % Confusion matrices for N exp. are stored in cell array. 
savepath2 = 'H:\Docs\Thesis_Work\Confusion\rate.mat'; 

save(savepath2,'rate','-mat'); % Class, rates for each of N exp. are stored in cell array. 
% 95% Confidence intervals computation and average of N exps. classification results, 
confmt = zeros(8,2); 

b = []; sum_conf = zeros(7,7); d = zeros(80,l); 
for i = 1 ;N 

a = diag(conf{i})'; % Diag. elements of confusion matrices set as rows, 

b = [b ; a]; 

sumconf = sumconf + confji}; 
d(i) = rate{i}; 
end 


c = sort(b); % Column s of the matrix b is sorted in ascending order, 

e = sort(d); 

confmt(l:7,l) = c(l+(N*0.05/2),:); % min in 95 % confidence interval. 

confmt(l:7,2) = c(N-(N*0.05/2),:); % max in 95 % confidence interval. 

conlint(8,l) = e(l+(N*0.05/2),l); % min in 95 % conf. interval for avg classification. 

conlint(8,2) = e(N-(N*0.05/2),l); % max in 95 % conf. interval for avg classification. 

classification = mean(c); % Mean of N exp. classification result. 

mean_conf = sum_conf / N; % Mean of N exp. confusion matrix. 

savepathS = 'H:\Docs\Thesis_Work\Confusion\confint.mat'; 

save(savepath3,'confint','-mat'); % 95% confidence intervals are stored in array. 

savepathd = 'H:\Docs\Thesis_Work\Confusion\classification.mat'; 

save(savepath4,'classification','-mat'); % Average classification rates for all words 

% out of N exp. are stored in array. 

savepath5 = 'H:\Docs\Thesis_Work\Confusion\mean_conf.mat'; 
save(savepath5,'mean_conf ,'-mat'); % Average confusion matrix for N exp. is 

% stored in array. 

% 

% - 

% End of function hmmresult.m 

% - 
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